spark开发实例
文章目录
- spark开发实例
- 1:开发准备
- 2:spark编程核心组件介绍
- 2.1:context
- 2.2:Master URL配置
- 3:SparkCore:wordcount
- 4:spark sql+hive
- 4.1:基本介绍
- 5 :sparkstreaming
- 6:maven依赖
1:开发准备
java,hadoop,scala,maven的windows环境都已配置并验证完毕
2:spark编程核心组件介绍
2.1:context
在Spark中,所有的编程入口都是各种各样的Context
2.1 SparkCore中的入口:SparkContext**
有java和scala之分,java中是JavaSparkContext,scala是SparkContext
2.2 SparkSQL,
spark2.0以前使用SQLContext,或者HiveContext
spark2.0以后统一使用SparkSession的api
2.3 SparkStreaming中,使用StreamingContext作为入口
2.2:Master URL配置
说明:Master URL代表的含义是Spark作业运行的方式
本地模式(local) --Spark作业在本地运行(Spark Driver和Executor在本地运行)
local[M] :为当前的Spark作业分配M个工作线程
local[M, N] :为当前的Spark作业分配M个工作线程,如果Spark作业提交失败,会进行最多N的重试
1:基于spark自身集群(standalone)
格式:spark://<masterIp>:<masterPort>
比如:spark://bigdata01:7077
HA格式: spark://<masterIp1>:<masterPort1>,<masterIp2>:<masterPort2>,...
比如:spark://bigdata01:7077,bigdata02:7077
2:基于yarn(国内主流)
cluster: SparkContext或者Driver在yarn集群中进行创建
client: SparkContext或者Driver在本地创建(所谓本地就是提交作业的那一台机器)
3:基于mesos(欧美主流):
3:SparkCore:wordcount
先把集群的hdfs-site,xml,core-site.xml都放到resources下面
原始数据:data1.txt
1 Java bigdata
2 Java bigdata
/**
推荐大家的编程方式:倒推+填空(大体的逻辑架构)
* 编程步骤:
* 1、构建SparkContext
* SparkContext需要一个SparkConf的依赖
* conf中必须要指定master url
* conf中必须要指定应用名称
* 2、加载外部数据,形成一个RDD
* 3、根据业务需要对RDD进行转换操作
* 4、提交作业
* 5、释放资源
*/
object SparkScalaWordCount {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.spark-project").setLevel(Level.WARN)
val conf = new SparkConf()
//conf指定
conf.setAppName(s"${SparkScalaWordCount.getClass.getSimpleName}")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
//加载本地的文件
val lines:RDD[String] = sc.textFile("file:///C:data1.txt")
//加载hdfs
// val lines:RDD[String] = sc.textFile("hdfs://ns1/data/spark/streaming/test/hello.txt")
println("partition: " + lines.getNumPartitions)
val words:RDD[String] = lines.flatMap(line => line.split("\\s+"))
val pairs:RDD[(String, Int)] = words.map(word => (word, 1))
val ret:RDD[(String, Int)] = pairs.reduceByKey((v1, v2) => {
v1 + v2 })
ret.foreach{case (word, count) => println(word + "--->" + count)}
sc.stop()
}}
结果:
Java--->2
1--->1
2--->1
bigdata--->2
4:spark sql+hive
4.1:基本介绍
前提准备好相关配置文件
sql语句:双表关联
select i.name, b.age, b.married, b.children, i.height from teacher_info i left join teacher_basic b on i.name = b.name
代码开发
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("SparkSQLLoadAndSave")
.enableHiveSupport()//有他可以提供hive的相关操作
.getOrCreate()
//创建一个数据库
spark.sql("create database db_1810")
val createInfoSQL =
"""
|create table `db_1810`.`teacher_info` (name string, height int) row format delimited
|fields terminated by ','
""".stripMargin
spark.sql(createInfoSQL)
val createBasicSQL =
"""
|create table `db_1810`.`teacher_basic` ( name string, age int, married boolean, children int) row format delimited
|fields terminated by ','
""".stripMargin
spark.sql(createBasicSQL)
//加载数据
val loadInfoSQL = "load data inpath 'hdfs://ns1/input/spark/teacher_info.txt' into table `db_1810`.`teacher_info`"
val loadBasicSQL = "load data inpath 'hdfs://ns1/input/spark/teacher_basic.txt' into table `db_1810`.`teacher_basic`"
spark.sql(loadInfoSQL)
spark.sql(loadBasicSQL)
//执行关联查询
val sql =
"""
|select i.name, b.age, b.married,b.children,i.height
|from `db_1810`.`teacher_info` i left join `db_1810`.`teacher_basic` b on i.name = b.name
""".stripMargin
val joinedDF = spark.sql(sql)
//将结果落地到Hive中的表teacher中
joinedDF.write.saveAsTable("`db_1810`.`teacher`")
spark.stop()
}
5 :sparkstreaming
sparkstreaming多结合kafka一起使用
本次基于kafka的direct方式进行读取
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
Logger.getLogger("org.spark-project").setLevel(Level.WARN)
if(args == null || args.length < 3) {
println(
"""Parameter Errors ! Usage: <batchInterval> <groupId> <topicList>
|batchInterval : 作业提交的间隔时间
|groupId : 分组id
|topicList : 要消费的topic列表
""".stripMargin)
System.exit(-1)
}
val Array(batchInterval, groupId, topicList) = args
val conf = new SparkConf()
.setAppName("SparkStreamingWithDirectKafkaOps")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(batchInterval.toLong))
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "bigdata01:9092,bigdata02:9092,bigdata03:9092",
"group.id" -> groupId,
//largest从偏移量最新的位置开始读取数据
//smallest从偏移量最早的位置开始读取
"auto.offset.reset" -> "smallest"
)
val topics = topicList.split(",").toSet
//基于Direct的方式读取数据
val kafkaDStream:InputDStream[(String, String)] = KafkaUtils
.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
kafkaDStream.foreachRDD((rdd, bTime) => {
if(!rdd.isEmpty()) {
println(s"Time: $bTime")
rdd.foreach{case (key, value) => { println(value) }}
println("偏移量范围:")
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
for (offsetRange <- offsetRanges) {
val topic = offsetRange.topic
val parition = offsetRange.partition
val fromOffset = offsetRange.fromOffset
val untilOffset = offsetRange.untilOffset
val count = offsetRange.count()
println(s"topic:${topic}, partition:${parition}, " +
s"fromOffset:${fromOffset}, untilOffset:${untilOffset}, count:${count}")
}
}
})
ssc.start()
//保证持续运行
ssc.awaitTermination()
}
6:maven依赖
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.2</spark.version>
<hadoop.version>2.6.4</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<!-- scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.39</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- SparkSQL -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- sparksql 和hive的整合依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- SparkStreaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- scala with jdbc -->
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc_2.11</artifactId>
<version>3.2.0</version>
</dependency>
</dependencies>