spark实时分析 spark实时处理

转载

IT独行侠客 2023-07-11 17:00:56

文章标签 spark实时分析 spark 构造器 Streaming 文章分类 Spark 大数据

Spark Streaming核心概念
我们所谓的Spark Streaming做数据的实时处理，并不是一个真正的实时处理，是因为并非是来一条数据就处理一条数据。本质上Spark Streaming是将进来的数据流DStreams，按照我们指定的时间间隔，拆成了小批次数据，进行处理。其中每一个批次就是一个RDD。
官网：Spark Streaming - Spark 3.2.0 Documentation (apache.org)
一、StreamingContext
1）一定要配置master和name两个参数。
二、DStreams（Discretized Streams）
1）DStreams理解为一个持续的数据流
2）DStreams是Spark Streaming核心编程的抽象模型
3）DStreams有两个来源：源端的数据输入，是DStreams；源端数据输入后，经过一系列计算后的输出结果，依然是DStreams。代码表示

val result = lines.flatMap(_.split(",")).map((_,1))

spark实时分析 spark实时处理_Streaming

三、Input DStreams and Receiver
1）Input DStreams就是数据源端接收过来的数据流，即从Kafka、Flume、HDFS接收过来的数据就是Input DStreams。代码表示

val lines = ssc.socketTextStream("spark000", 9527)

2）lines就是Input DStreams，即从netcat服务器上接收过来的数据流。
3）每一个Input DStreams都关联一个Receiver（除了文件流）。代码表示

val lines: ReceiverInputDStream[String]= ssc.socketTextStream("spark000", 9527)

4）ReceiverInputDStream是一个抽象类，是接收输入的DStream。需要启动在worker节点上的receiver，并接收外部的输入数据。然后交给spark内存，最后交给引擎做相应处理。
5）ReceiverInputDStream要有一个getReceiver方法.
6）注意1：当在本地运行Spark Streaming时，master的配置一定不能使用local或者local[1]，会发生只有一个receiver接收。所以要设置local[n]，n一定要大于receiver的数量。
7）注意2：当业务逻辑运行在集群之上，cores的数量对应于上述的local[n]的数量。cores的数量一定要大于receiver的数量。否则会发生仅接收数据，并不处理数据。

实战之读取文件系统的数据
0）Spark Streaming本地读文件系统的数据。
1）数据源端有2个，其中一个是basic sources。basic sources中也有2个，socket和file。socket已经ok了，其中ssc.socketTextStream()和ssc.socketStream()底层调用的都是SocketInputDStream，只是提供的API不同。
2）file的话，file必须兼容HDFS的API。使用的话，是通过StreamingContext.fileStream[KeyClass, ValueClass, InputFormatClass]。代码表示：

streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

3）file是不需要运行一个receiver来接收数据的。所以在开发过程中local可以写1。
4）代码部分，修改了两个地方

1、object NetworkWordCount 修改为 object FileWordCount
2、val lines = ssc.socketTextStream("spark000", 9527)修改为val lines = ssc.textFileStream("file://本地文件地址")

5）Spark Streaming会去监控文件夹，只要有新的数据，就会处理。可以写本地，也可以写hdfs，也可以写正则表达式。但文件格式要相同（文件后缀）。
6）本地启动后，再新建文本并放入。

package com.imooc.bigdata.ss

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/*
 * object FileWordCount作用：完成词频统计分析
 * 数据源：HDFS
 *
 * ss的编程范式
 * 1)main方法
 * 2)找入口点:new StreamingContext().var
 * 3)添加SparkConf的构造器:new SparkConf().var
 * 4)参数1：sparkConf放入new StreamingContext()
 * 5)参数2：Seconds(5)放入new StreamingContext()
 * 6)生成ssc:new StreamingContext(sparkConf,Seconds(5)).var
 * 7)对接网络数据
 *   ssc.socketTextStream("spark000",9527).var
 * 8)开始业务逻辑处理
 *   启动流作业:ssc.start()
 *   输入数据以逗号分隔开:map是给每个单词赋值1,，然后两两相加。
     lines.flatMap(_.split(",")).map((_,1))
     .reduceBykey(_+_).var
 *   结果打印：
 *   终止流作业:ssc.awaitTermination()
 * 9)运行报错，添加val sparkConf = new SparkConf()参数
 */

object FileWordCount {
  // *****第1步
  /*
     对于NetworkWordCount这种Spark Streaming编程来讲，也是通过main方法
     输入main，回车
   */
  def main(args: Array[String]): Unit = {
    // *****第2步
    /*
       和kafka相同，找入口点
       官网：https://spark.apache.org/docs/latest/streaming-programming-guide.html
       要开发Spark Streaming应用程序，入口点就是：拿到一个streamingContext：new StreamingContext()
       看源码：按ctrl，进入StreamingContext.scala

     * 关于StreamingContext.scala的描述
       Main entry point for Spark Streaming functionality. It provides methods used to create DStream：
       [[org.apache.spark.streaming.dstream.DStream]
       那 DStream是什么呢？


     * 目前，鼠标放在StreamingContext()，报错：不能解析构造器，所以这里缺少构造器
       Cannot resolve overloaded constructor `StreamingContext`
       在scala里是有构造器的，主构造器、副主构造器。

     * 以下就是构造器要传的三个参数
     * class StreamingContext private[streaming] (
       _sc: SparkContext,
       _cp: Checkpoint,
       _batchDur: Duration
       )

     * 这个是副主构造器1：传的是sparkContext
     * batchDuration是时间间隔
     * def this(sparkContext: SparkContext, batchDuration: Duration) = {
       this(sparkContext, null, batchDuration)
       }

     * 这个是副主构造器2：传的是SparkConf
     * def this(conf: SparkConf, batchDuration: Duration) = {
       this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
       }
     */

    // *****第3步
    /* 添加SparkConf的构造器
       new SparkConf().var
       然后选择sparkConf。不建议加类型
       当打jar包时，两个参数要注释
     */
    val sparkConf = new SparkConf()
      .setAppName(this.getClass.getSimpleName)
      .setMaster("local[2]")

    // *****第2步
    /*
      new StreamingContext()
     */

    // *****第4步
    /*
       将第3步中新生成的sparkConf，放入new StreamingContext()括号中。
     */

    // *****第5步
    /*
     * 添加时间间隔Duration(毫秒)，可以看一下源码
     * 使用
     * object Seconds {
       def apply(seconds: Long): Duration = new Duration(seconds * 1000)
       }

     * 并导入org.apache的包,往Seconds()放5
     * 意味着指定间隔5秒为一个批次
     */

    // *****第6步
    /*
       new StreamingContext(sparkConf,Seconds(5)).var
       输入ssc
     */
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // TODO... 对接业务数据
    // *****第7步:先调用start启动
    /*
       Creates an input stream from TCP source hostname:port. Data is received using
       a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
     */
    //val lines = ssc.socketTextStream("spark000", 9527)
    val lines = ssc.textFileStream("file:///C://Users//jieqiong//Desktop//tmp//ss")


    // TODO... 业务逻辑处理
    // *****第9步:输入数据以逗号分隔,并打印结果
    val result = lines.flatMap(_.split(",")).map((_,1))
      .reduceByKey(_+_)
    result.print()
    // *****第8步:先调用start启动\终止
    ssc.start()
    ssc.awaitTermination()
  }
}

常用Transformation操作
0）官网：Spark Streaming - Spark 3.2.0 Documentation (apache.org)
1）Spark Streaming中，基于DStreams的一些Transformation操作
2）作用于DStreams的Transformation操作，和RDD是非常类似的。Transformation中的数据是允许修改的。Transformation支持许多方法的：map、flatMap、filter、glom等。
3）开启dfs、yarn、master、Zookeeper、并开启端口号nc -lk 9527
4）在命令行中输入数据，开始测试： 20221212,test
20221224,pk
20221212,jieqiong

package com.imooc.bigdata.ss

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object TransformApp {

  // *****第1步
  /*
     对于NetworkWordCount这种Spark Streaming编程来讲，也是通过main方法
     输入main，回车
   */
  def main(args: Array[String]): Unit = {
    // *****第2步
    /*
       和kafka相同，找入口点
       官网：https://spark.apache.org/docs/latest/streaming-programming-guide.html
       要开发Spark Streaming应用程序，入口点就是：拿到一个streamingContext：new StreamingContext()
       看源码：按ctrl，进入StreamingContext.scala

     * 关于StreamingContext.scala的描述
       Main entry point for Spark Streaming functionality. It provides methods used to create DStream：
       [[org.apache.spark.streaming.dstream.DStream]
       那 DStream是什么呢？


     * 目前，鼠标放在StreamingContext()，报错：不能解析构造器，所以这里缺少构造器
       Cannot resolve overloaded constructor `StreamingContext`
       在scala里是有构造器的，主构造器、副主构造器。

     * 以下就是构造器要传的三个参数
     * class StreamingContext private[streaming] (
       _sc: SparkContext,
       _cp: Checkpoint,
       _batchDur: Duration
       )

     * 这个是副主构造器1：传的是sparkContext
     * batchDuration是时间间隔
     * def this(sparkContext: SparkContext, batchDuration: Duration) = {
       this(sparkContext, null, batchDuration)
       }

     * 这个是副主构造器2：传的是SparkConf
     * def this(conf: SparkConf, batchDuration: Duration) = {
       this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
       }
     */

    // *****第3步
    /* 添加SparkConf的构造器
       new SparkConf().var
       然后选择sparkConf。不建议加类型
       当打jar包时，两个参数要注释
     */
    val sparkConf = new SparkConf()
      .setAppName(this.getClass.getSimpleName)
      .setMaster("local[2]")

    // *****第2步
    /*
      new StreamingContext()
     */

    // *****第4步
    /*
       将第3步中新生成的sparkConf，放入new StreamingContext()括号中。
     */

    // *****第5步
    /*
     * 添加时间间隔Duration(毫秒)，可以看一下源码
     * 使用
     * object Seconds {
       def apply(seconds: Long): Duration = new Duration(seconds * 1000)
       }

     * 并导入org.apache的包,往Seconds()放5
     * 意味着指定间隔5秒为一个批次
     */

    // *****第6步
    /*
       new StreamingContext(sparkConf,Seconds(5)).var
       输入ssc
     */
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //这里的编程模型是RDD
    val data = List("pk")
    val dataRDD = ssc.sparkContext.parallelize(data).map(x => (x, true))


    // TODO... 对接业务数据
    // *****第7步:先调用start启动
    /*
       Creates an input stream from TCP source hostname:port. Data is received using
       a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
     */
    /*
    20221212,pk  => (pk,   20221212,pk)
    20221212,test
     */
    val lines = ssc.socketTextStream("spark000", 9527)

    //这里的编程模型是DStream,DStream和RDD做join
    lines.map(x => (x.split(",")(1), x))
      .transform(rdd => {
        rdd.leftOuterJoin(dataRDD)
          .filter(x => {
            x._2._2.getOrElse(false) != true
          }).map(x => x._2._1)
      }).print()
    // *****第8步:先调用start启动\终止
    ssc.start()
    ssc.awaitTermination()
  }
}

实战之带状态的应用程序开发
1）累加已传值，将目前上传值和历史上传值进行拼接：updateStateByKey(func)
2）将结果进行累加输出到控制台，记得开启端口

package com.imooc.bigdata.ss

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/*
 * object NetworkWordCount作用：完成词频统计分析 （带state）
 * 数据源：是基于端口、网络即nc的方式，造数据
 *
 * ss的编程范式
 * 1)main方法
 * 2)找入口点:new StreamingContext().var
 * 3)添加SparkConf的构造器:new SparkConf().var
 * 4)参数1：sparkConf放入new StreamingContext()
 * 5)参数2：Seconds(5)放入new StreamingContext()
 * 6)生成ssc:new StreamingContext(sparkConf,Seconds(5)).var
 * 7)对接网络数据
 *   ssc.socketTextStream("spark000",9527).var
 * 8)开始业务逻辑处理
 *   启动流作业:ssc.start()
 *   输入数据以逗号分隔开:map是给每个单词赋值1,，然后两两相加。
     lines.flatMap(_.split(",")).map((_,1))
     .reduceBykey(_+_).var
 *   结果打印：
 *   终止流作业:ssc.awaitTermination()
 * 9)运行报错，添加val sparkConf = new SparkConf()参数
 */

object StateNetworkWordCount {
  // *****第1步
  /*
     对于NetworkWordCount这种Spark Streaming编程来讲，也是通过main方法
     输入main，回车
   */
  def main(args: Array[String]): Unit = {
    // *****第2步
    /*
       和kafka相同，找入口点
       官网：https://spark.apache.org/docs/latest/streaming-programming-guide.html
       要开发Spark Streaming应用程序，入口点就是：拿到一个streamingContext：new StreamingContext()
       看源码：按ctrl，进入StreamingContext.scala

     * 关于StreamingContext.scala的描述
       Main entry point for Spark Streaming functionality. It provides methods used to create DStream：
       [[org.apache.spark.streaming.dstream.DStream]
       那 DStream是什么呢？


     * 目前，鼠标放在StreamingContext()，报错：不能解析构造器，所以这里缺少构造器
       Cannot resolve overloaded constructor `StreamingContext`
       在scala里是有构造器的，主构造器、副主构造器。

     * 以下就是构造器要传的三个参数
     * class StreamingContext private[streaming] (
       _sc: SparkContext,
       _cp: Checkpoint,
       _batchDur: Duration
       )

     * 这个是副主构造器1：传的是sparkContext
     * batchDuration是时间间隔
     * def this(sparkContext: SparkContext, batchDuration: Duration) = {
       this(sparkContext, null, batchDuration)
       }

     * 这个是副主构造器2：传的是SparkConf
     * def this(conf: SparkConf, batchDuration: Duration) = {
       this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
       }
     */

    // *****第3步
    /* 添加SparkConf的构造器
       new SparkConf().var
       然后选择sparkConf。不建议加类型
       当打jar包时，两个参数要注释
     */
    val sparkConf = new SparkConf()
      .setAppName(this.getClass.getSimpleName)
      .setMaster("local[2]")

    // *****第2步
    /*
      new StreamingContext()
     */

    // *****第4步
    /*
       将第3步中新生成的sparkConf，放入new StreamingContext()括号中。
     */

    // *****第5步
    /*
     * 添加时间间隔Duration(毫秒)，可以看一下源码
     * 使用
     * object Seconds {
       def apply(seconds: Long): Duration = new Duration(seconds * 1000)
       }

     * 并导入org.apache的包,往Seconds()放5
     * 意味着指定间隔5秒为一个批次
     */

    // *****第6步
    /*
       new StreamingContext(sparkConf,Seconds(5)).var
       输入ssc
     */
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //将历史数据放到checkpoint中
    ssc.checkpoint("log-ss")

    // TODO... 对接业务数据
    // *****第7步:先调用start启动
    /*
       Creates an input stream from TCP source hostname:port. Data is received using
       a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
     */
    val lines = ssc.socketTextStream("spark000", 9527)

    // TODO... 业务逻辑处理
    // *****第9步:输入数据以逗号分隔,并打印结果
    val result = lines.flatMap(_.split(",")).map((_,1))
      .updateStateByKey[Int](updateFunction _)
    //  .reduceByKey(_+_)
    result.print()
    // *****第8步:先调用start启动\终止
    ssc.start()
    ssc.awaitTermination()
  }
  def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    //使用新值，结合已有的老值，进行fun的操作
    val current = newValues.sum
    val old = runningCount.getOrElse(0)
    Some(current + old)
  }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。