spark开发实例


文章目录

  • spark开发实例
  • 1:开发准备
  • 2:spark编程核心组件介绍
  • 2.1:context
  • 2.2:Master URL配置
  • 3:SparkCore:wordcount
  • 4:spark sql+hive
  • 4.1:基本介绍
  • 5 :sparkstreaming
  • 6:maven依赖


1:开发准备

java,hadoop,scala,maven的windows环境都已配置并验证完毕

如何对spark进行二次开发 spark程序开发_spark

2:spark编程核心组件介绍

2.1:context

在Spark中,所有的编程入口都是各种各样的Context
2.1  SparkCore中的入口:SparkContext**
 有java和scala之分,java中是JavaSparkContext,scala是SparkContext
2.2	SparkSQL,
spark2.0以前使用SQLContext,或者HiveContext
spark2.0以后统一使用SparkSession的api
2.3 SparkStreaming中,使用StreamingContext作为入口

2.2:Master URL配置

说明:Master URL代表的含义是Spark作业运行的方式
           本地模式(local) --Spark作业在本地运行(Spark Driver和Executor在本地运行)
            local[M]    :为当前的Spark作业分配M个工作线程
            local[M, N] :为当前的Spark作业分配M个工作线程,如果Spark作业提交失败,会进行最多N的重试
1:基于spark自身集群(standalone)
                格式:spark://<masterIp>:<masterPort>
                    比如:spark://bigdata01:7077
                HA格式: spark://<masterIp1>:<masterPort1>,<masterIp2>:<masterPort2>,...
                    比如:spark://bigdata01:7077,bigdata02:7077
2:基于yarn(国内主流)
                   cluster: SparkContext或者Driver在yarn集群中进行创建
                   client:  SparkContext或者Driver在本地创建(所谓本地就是提交作业的那一台机器)
3:基于mesos(欧美主流):

3:SparkCore:wordcount

先把集群的hdfs-site,xml,core-site.xml都放到resources下面
原始数据:data1.txt

1 Java bigdata
2 Java bigdata
/**
推荐大家的编程方式:倒推+填空(大体的逻辑架构)
  * 编程步骤:
  *     1、构建SparkContext
  *         SparkContext需要一个SparkConf的依赖
  *             conf中必须要指定master url
  *             conf中必须要指定应用名称
  *     2、加载外部数据,形成一个RDD
  *     3、根据业务需要对RDD进行转换操作
  *     4、提交作业
  *     5、释放资源
  */
object SparkScalaWordCount {
    def main(args: Array[String]): Unit = {
        Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.spark-project").setLevel(Level.WARN)
     
        val conf = new SparkConf()
      //conf指定
        conf.setAppName(s"${SparkScalaWordCount.getClass.getSimpleName}")
        conf.setMaster("local[*]")
        val sc = new SparkContext(conf)
        //加载本地的文件
        val lines:RDD[String] = sc.textFile("file:///C:data1.txt")
        //加载hdfs
//        val lines:RDD[String] = sc.textFile("hdfs://ns1/data/spark/streaming/test/hello.txt")
        println("partition: " + lines.getNumPartitions)
        val words:RDD[String] = lines.flatMap(line => line.split("\\s+"))
        val pairs:RDD[(String, Int)] = words.map(word => (word, 1))
        val ret:RDD[(String, Int)] = pairs.reduceByKey((v1, v2) => {
            v1 + v2   })
        ret.foreach{case (word, count) => println(word + "--->" + count)}
        sc.stop()
    }}

结果:

Java--->2
1--->1
2--->1
bigdata--->2

4:spark sql+hive

4.1:基本介绍

前提准备好相关配置文件

sql语句:双表关联
select i.name,  b.age, b.married,  b.children, i.height from teacher_info i left join teacher_basic b on i.name = b.name

代码开发

def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder()
            .appName("SparkSQLLoadAndSave")
            .enableHiveSupport()//有他可以提供hive的相关操作
            .getOrCreate()
        //创建一个数据库
        spark.sql("create database db_1810")

        val createInfoSQL =
            """
              |create table `db_1810`.`teacher_info` (name string, height int) row format delimited
              |fields terminated by ','
            """.stripMargin
        spark.sql(createInfoSQL)

        val createBasicSQL =
            """
              |create table `db_1810`.`teacher_basic` ( name string, age int, married boolean, children int) row format delimited
              |fields terminated by ','
            """.stripMargin
        spark.sql(createBasicSQL)
        //加载数据
        val loadInfoSQL = "load data inpath 'hdfs://ns1/input/spark/teacher_info.txt' into table `db_1810`.`teacher_info`"
        val loadBasicSQL = "load data inpath 'hdfs://ns1/input/spark/teacher_basic.txt' into table `db_1810`.`teacher_basic`"
        spark.sql(loadInfoSQL)
        spark.sql(loadBasicSQL)
        //执行关联查询
        val sql =
            """
              |select i.name, b.age, b.married,b.children,i.height
              |from `db_1810`.`teacher_info` i  left join `db_1810`.`teacher_basic` b on i.name = b.name
            """.stripMargin
        val joinedDF = spark.sql(sql)

        //将结果落地到Hive中的表teacher中
        joinedDF.write.saveAsTable("`db_1810`.`teacher`")
        spark.stop()
    }

5 :sparkstreaming

sparkstreaming多结合kafka一起使用
本次基于kafka的direct方式进行读取

def main(args: Array[String]): Unit = {
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
        Logger.getLogger("org.spark-project").setLevel(Level.WARN)
        if(args == null || args.length < 3) {
            println(
                """Parameter Errors ! Usage: <batchInterval> <groupId> <topicList>
                  |batchInterval    :  作业提交的间隔时间
                  |groupId          :  分组id
                  |topicList        :  要消费的topic列表
                """.stripMargin)
            System.exit(-1)
        }
        val Array(batchInterval, groupId, topicList) = args

        val conf = new SparkConf()
                    .setAppName("SparkStreamingWithDirectKafkaOps")
                    .setMaster("local[*]")

        val ssc = new StreamingContext(conf, Seconds(batchInterval.toLong))
        val kafkaParams = Map[String, String](
            "bootstrap.servers" -> "bigdata01:9092,bigdata02:9092,bigdata03:9092",
            "group.id" -> groupId,
            //largest从偏移量最新的位置开始读取数据
            //smallest从偏移量最早的位置开始读取
            "auto.offset.reset" -> "smallest"
        )
        val topics = topicList.split(",").toSet
        //基于Direct的方式读取数据
        val kafkaDStream:InputDStream[(String, String)] = KafkaUtils
            .createDirectStream[String, String, StringDecoder, StringDecoder](
            ssc, kafkaParams, topics)

        kafkaDStream.foreachRDD((rdd, bTime) => {
            if(!rdd.isEmpty()) {
                println(s"Time: $bTime")
                rdd.foreach{case (key, value) => { println(value) }}
                println("偏移量范围:")
                val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
                for (offsetRange <- offsetRanges) {
                    val topic = offsetRange.topic
                    val parition = offsetRange.partition
                    val fromOffset = offsetRange.fromOffset
                    val untilOffset = offsetRange.untilOffset
                    val count = offsetRange.count()
                    println(s"topic:${topic}, partition:${parition}, " +
                        s"fromOffset:${fromOffset}, untilOffset:${untilOffset}, count:${count}")
                }
            }
        })
        ssc.start()
        //保证持续运行
        ssc.awaitTermination()
    }

6:maven依赖

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <scala.version>2.11.8</scala.version>
    <spark.version>2.2.2</spark.version>
    <hadoop.version>2.6.4</hadoop.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
    </dependency>
    <!-- scala -->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.39</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    <!-- SparkSQL -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <!-- sparksql 和hive的整合依赖-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <!-- SparkStreaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <!-- scala with jdbc -->
    <dependency>
      <groupId>org.scalikejdbc</groupId>
      <artifactId>scalikejdbc_2.11</artifactId>
      <version>3.2.0</version>
    </dependency>
  </dependencies>