Spark Streaming

一、概述

http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming是Spark Core功能拓展,可以实现数据流的可扩展、高吞吐、容错处理。SparkStreaming处理的数据可以来源于多种数据源(如:Kafka、Flume、TCP套接字),这些数据流经过流式计算的复杂处理和加工,最终会将计算的结果存储到文件系统、数据库或者在仪表盘进行数据的实时展示。

springboot与spark的集成_spark

在内部,SparkStreaming会将接受的流数据拆分为一个个批次数据(micro batch),通过Spark引擎处理微批RDD,产生最终的结果流。

springboot与spark的集成_springboot与spark的集成_02

在Spark Streaming中有一个高等级的抽象称为离散流或者DStream。DStream可以通过外部的数据源构建或者转换获得新的DStream(类似于Spark RDD的使用);

DStream底层是由Seq[RDD]序列构成

二、第一个Spark Streaming应用

实时统计单词出现的次数

导入依赖

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.4.4</version>
</dependency>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming_2.11</artifactId>
  <version>2.4.4</version>
</dependency>

编写应用

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * 流式计算应用  实时统计单词出现的次数
 */
object StreamingWordCountApplication {
  def main(args: Array[String]): Unit = {
    //1. 创建StreamingContext对象 是Spark Streaming程序的运行入口
    // conf 配置对象  batchDuration 微批时间间隔
    val conf = new SparkConf().setMaster("local[*]").setAppName("streaming wordcount")
    val ssc = new StreamingContext(conf,Seconds(5))

    //2. 构建数据源的DStream
    // 通过TCP Source构建DStream对象 获取Tcp请求数据
    val lines = ssc.socketTextStream("localhost",8888)

    //3. 对数据源DStream应用转换计算操作 类似于Spark RDD
    lines
      .flatMap(_.split("\\s"))  // DStream ----> DStream
      .map((_,1L))
      .groupByKey()
      .map(t2 => (t2._1,t2._2.size))
      .print()  // 行动算子

    //4. 流式计算程序会一直运行 除非人为关闭或者异常终止
    ssc.start()

    //5. 等待计算终止
    ssc.awaitTermination()
  }
}

安装netcat,运行netcat服务获取客户端的请求数据

# centos在线安装
yum install -y nc

# 运行netcat网络服务
nc -lk 8888

# 进入到输入模式
nc -lk 8888
Hello Spark
Hello Spark
Hello Spark
Hello Spark
Hello Spark
Hello Spark

测试结果

-------------------------------------------
Time: 1580870700000 ms
-------------------------------------------
(Hello,2)
(Spark,2)

-------------------------------------------
Time: 1580870705000 ms
-------------------------------------------
(Hello,2)
(Spark,2)

...

三、DStream离散流原理

DStream是Spark Streaming中最为核心的抽象,表现为一段连续的数据流(本质上是一组连续的RDD的序列集合),一个DStream中的一个RDD含有一个固定间隔的数据集。

springboot与spark的集成_spark_03

应用在DStream上的任何操作底层都会转换为RDD的操作。

springboot与spark的集成_apache_04

核心思想:微批思想,底层使用spark rdd处理离散数据流

四、Input Source和Receivers

Input DStream表示从数据源接受的数据构建的DStream对象

构建Input DStream两种方式:

  • basic source: 通常不依赖第三方的依赖可以通过ssc直接创建,如:filesystem和socket
  • advanced source: 通常需要集成第三方依赖,如:kafka、flume其它流数据存储系统

basic source

文件系统

使用HDFS API从任意的文件系统读取文件目录数据,作为DStream数据源。如:

// 通过文件系统构建DStream 注意:路径指向一个目录而不是一个具体的文件
val lines = ssc.textFileStream("hdfs://spark:9000/data")

注意:

  • 路径是一个目录,不是具体的文件
  • 数据目录支持通配符,如:hdfs://spark:9000/data*;
  • 数据文件格式必须保证统一,建议文本类型
TCP Socket套接字
val lines = ssc.socketTextStream("localhost",8888)
RDD Queue

RDD队列,可以将多个RDD存放到一个Queue队列中构建DStream

// 注意:ssc中封装了sparkContext可以直接获取 无需手动创建
val rdd1 = ssc.sparkContext.makeRDD(List("Hello Spark","Hello Kafka"))
val rdd2 = ssc.sparkContext.makeRDD(List("Hello Scala","Hello Hadoop"))

// 通过Queue封装RDD,创建一个DStream
val queue = scala.collection.mutable.Queue(rdd1,rdd2)
val lines = ssc.queueStream(queue)
备注:

在Spark Streaming应用中不要使用local[1]作为Master URL地址,如果这样做的话,意味着只有一个线程用来接受流数据,而没有可供进行数据处理的线程。所以,线程的数量需要大于等于2,如:local[2]或者local[*]

advanced source

Kafka(最重要)
导入依赖
<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
   <version>2.4.4</version>
</dependency>
开发应用
package source

import org.apache.kafka.clients.consumer.{ConsumerConfig}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._

object KafkaSource {
  def main(args: Array[String]): Unit = {
    //1. 初始化ssc
    val conf = new SparkConf().setAppName("kafka wordcount").setMaster("local[*]")
    val ssc = new StreamingContext(conf,Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    //2. 初始化kafka的配置对象
    val kafkaParams = Map[String,Object](
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,classOf[StringDeserializer]),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,classOf[StringDeserializer]),
      (ConsumerConfig.GROUP_ID_CONFIG,"g1")
    )

    //3. 准备Array 填写需要订阅的主题Topic
    val arr = Array("spark")

    //4. 通过工具类初始化DStream
    val lines = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent, // 位置策略 优化方案
      ConsumerStrategies.Subscribe[String,String](arr,kafkaParams)
    )

    //5. 对数据的处理
    lines
      // kafka record(数据)  ---> value
      .map(record => record.value())
      .flatMap(_.split("\\s"))
      .map((_,1))
      .groupByKey()
      .map(t2 => (t2._1,t2._2.size))
      .print()

    //6. 启动streaming应用
    ssc.start()

    //7. 优雅的停止应用
    ssc.awaitTermination()
  }
}
启动kafka服务,并且启动消息的生产者
# 启动zk
bin/zkServer.sh start conf/zoo.cfg
# 启动kafka
bin/kafka-server-start.sh -daemon config/server.properties
# 创建spark topic
bin/kafka-topics.sh --create --topic spark --zookeeper spark:2181 --partitions 1 --replication-factor 1
Created topic "spark".
# 启动spark topic的生产者
bin/kafka-console-producer.sh --topic spark --broker-list spark:9092
>
Flume
flume配置文件
# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 9999

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
导入依赖
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-flume_2.11</artifactId>
  <version>2.4.4</version>
</dependency>
开发应用
package source

import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FlumeSource {
  def main(args: Array[String]): Unit = {
    //1. 初始化ssc
    val conf = new SparkConf().setAppName("kafka wordcount").setMaster("local[*]")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    //2. 通过Flume的工具类初始化DStream
    val dstream = FlumeUtils
      .createStream(ssc, "localhost", 9999)

    // event body内容封装到 String
    dstream
      .map(event => new String(event.event.getBody.array()))
      .print()
    //6. 启动streaming应用
    ssc.start()

    //7. 优雅的停止应用
    ssc.awaitTermination()
  }
}
启动Flume Agent采集数据
bin/flume-ng agent --conf conf --conf-file conf/simple.conf --name a1 -Dflume.root.logger=INFO,console
启动telnet服务,发送启动数据
telnet localhost 44444
Hello Spark
OK
Hello Spark
OK
Custom Sources

http://spark.apache.org/docs/latest/streaming-custom-receivers.html

五、Spark Streaming中的转换算子

转换:DStream —> DStream 换句话来说 Seq[RDD] —> Seq[RDD]

RDD转换:RDD —> RDD

常用的转换操作,如表所示:

Transformation

Meaning

map(func)

Return a new DStream by passing each element of the source DStream through a function func.

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items.

filter(func)

Return a new DStream by selecting only the records of the source DStream on which func returns true.

repartition(numPartitions)

Changes the level of parallelism in this DStream by creating more or fewer partitions.

union(otherStream)

Return a new DStream that contains the union of the elements in the source DStream and otherDStream.

count()

Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.

reduce(func)

Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.

countByValue()

When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.

reduceByKey(func, [numTasks])

When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.

join(otherStream, [numTasks])

When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.

cogroup(otherStream, [numTasks])

When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, (Seq[V], Seq[W]) tuples.

transform(func) 非常重要

Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.

updateStateByKey(func)

Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

常用的转换操作

package transformation

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object TransformationDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("11").setMaster("local[*]")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    // 数据源DStream 注意:hostname虚拟机的IP地址
    val lines = ssc.socketTextStream("localhost", 8888)

    //********************************************************************
    // map转换函数
    /*
    lines
      .map((_,true))  // 将一种类型映射为另外的一种类型
      // 将tuple2中的第一值 按照"," 展开
      .flatMap(_._1.split(","))      // 将一个内容展开为多个内容
      .filter(_.equals("Hello"))    // 保留符合条件的结果
      .repartition(3)  // 调整并行度 分区数量可大可小
      .print()
    */
    //********************************************************************

    //********************************************************************
    // union 合并转换函数
    /*
    val d1 = ssc.socketTextStream("localhost",7777)
    val d2 = ssc.socketTextStream("localhost",6666)

    // 合并操作要求DStream的元素类型必须一致
    val d3 = d1.union(d2)
    d3.print()
    */
    //********************************************************************

    //********************************************************************
    // count 转换函数  统计每一个微批RDD中元素数量
    /*
    val d1 = ssc.socketTextStream("localhost",7777)
    val d2 = d1.count()
    d2.print()
     */
    //********************************************************************

    //********************************************************************
    // reduce 转换函数  计算每一个微批RDD中的结果
    /*
    val d1 = ssc.socketTextStream("localhost", 7777)
    // "1" "2" "3" "4" "5"
    d1
      // "1" "2" "3" "4" "5" ---> 1 2 3 4 5
      .map(str => str.toInt)
      // 1+2 =3
      // 3+3 =6
      // 6+4 =10
      // 10+5 = 15
      .reduce(_ + _)
      .print()
     */
    //********************************************************************

    //********************************************************************
    // countByValue  统计每一个微批中相同元素的数量  返回(元素,次数)
    /*
    val d1 = ssc.socketTextStream("localhost", 7777)
    d1
      .flatMap(_.split(","))
      .countByValue()
      .print()
     */
    //********************************************************************

    //********************************************************************
    // reduceByKey 对一个DStream[(k,v)] 返回相同K的v计算结果
    /*
    val d1 = ssc.socketTextStream("localhost", 7777)
    // a,a,a,b,c
    d1
      .flatMap(_.split(","))
      .map((_, 1L))
      .reduceByKey((v1, v2) => v1 + v2, numPartitions = 2)
      .print()
    */
    //********************************************************************

    //********************************************************************
    // join 连接操作  对DStream[(k,v)]和DStream[(k,w)]连接 返回DStream[(k,(v,w))]
    /*
    val d1 = ssc.socketTextStream("localhost", 7777).map((_, 1))
    val d2 = ssc.socketTextStream("localhost", 6666).map((_, 1))
    // 内连接  同样支持左、右、全外连接
    d1
      //.join(d2)
      .leftOuterJoin(d2)  // (k,(v,Option)   //Some or None
      .print()
     */
    //********************************************************************

    
    // cogroup 共同分组方法  对DStream[(k,v)]和DStream[(k,w)]分区,返回DStream[(k,Seq[v],Seq[w])]
    // d1: (userId,name)
    // d2: (userId,订单)
    // d1.cogroup(d2)
    // (userId,(Seq(name),Seq(订单集合)))
    //********************************************************************
    /*
    val d1 = ssc.socketTextStream("localhost", 7777).map((_, 1))
    val d2 = ssc.socketTextStream("localhost", 6666).map((_, 1))
    d1
      .cogroup(d2)
      .print()
    */
    //********************************************************************
    

    //********************************************************************
    // transform  将DStream转换为RDD的操作并且返回一个新的DStream
    val d1 = ssc.socketTextStream("localhost", 7777)
    // 数据采样 sample
    d1
      .transform(rdd => {
        // rdd的转换操作 如果以后遇到DStream 函数无法处理的需求时,也可以将转换为RDD的操作
        // rdd.map((_, 1))
        rdd.sample(true,0.3)
      })
      .print()
    //********************************************************************
    // 启动应用
    ssc.start()

    ssc.awaitTermination()
  }
}

状态相关操作

updateStateByKey(func)

根据key更新状态(state),允许使用新的信息不断更新维护状态数据

package transformation.state

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StateDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("word count")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")
    val lines = ssc.socketTextStream("localhost", 7777)

    // 必须设置检查点checkpoint 本地状态(内存) + 远程状态(checkpoint)
    // 启动hdfs服务 + 添加虚拟参数-DHADOOP_USER_NAME=root
    ssc.checkpoint("hdfs://spark:9000/checkpoint170")

    // 实现有状态的计算
    lines
      .flatMap(_.split(","))
      .map((_, 1))
      // 参数为:状态更新的匿名函数
      // values: 当前微批中的key相同的value的序列集合
      // state: 累积的状态数据
      .updateStateByKey((values: Seq[Int], state: Option[Int]) => {
        // 使用当前微批的统计结果 + state数据 => 最新的计算结果
        // 当前微批中key相同的value的数量
        val newState = values.size + state.getOrElse(0)
        // 累积的最新的状态数据
        Some(newState)
      })
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

注意:

  • 需要提前设定检查点目录(通常使用共享的分布式文件系统存放检查点数据);
mapWithState

状态更新的另外一种方式,相比于updateStateByKey性能有10倍提升(全量更新),增量更新

package transformation.state

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}

object StateDemo2 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("word count")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")
    val lines = ssc.socketTextStream("localhost", 7777)

    // 必须设置检查点checkpoint 本地状态(内存) + 远程状态(checkpoint)
    // 启动hdfs服务 + 添加虚拟参数-DHADOOP_USER_NAME=root
    ssc.checkpoint("hdfs://spark:9000/checkpoint170_2")

    // 实现有状态的计算
    lines
      .flatMap(_.split(","))
      .map((_, 1L))
      // 参数:StateSpec
      // StateSpec.function方法 ---> StateSpec
      // state: k=word v=累积次数
      // key:单词  value:次数  state:历史状态
      .mapWithState(StateSpec.function((key: String, value: Option[Long], state: State[Long]) => {
        var newCount = 0L;
        // key存在历史状态数据
        if (state.exists()) {
          newCount = state.get() + value.get
        } else {
          newCount = 1
        }
        // newCount更新State中历史状态数据
        state.update(newCount)
        // 将累积的结果作为DStream中的元素返回
        (key, newCount)
      }))
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}
容错处理
CheckPointing 检查点

对于一个流式计算应用必须能够全天候7*24小时不间断运行,因此必须能够适应与程序逻辑无关的错误(如:JVM奔溃、系统错误、外界的因素)。为了解决这个问题,Spark Streaming需要将足够多的信息保存到容错存储系统中(HDFS),出现故障时可以及时故障恢复。检查点中有两种类型的数据:

  • 元数据检查点metadata checkpointing):流式计算应用的信息
  • Configuration:流式应用的配置信息 sparkconf
  • DStream操作函数: 转换函数和行动函数
  • Incomplete Btaches: 未处理完成的微批数据
  • 数据检查点(data checkpointing):将生产的RDD保存到可靠的存储系统中并且切断依赖链路

检查点= 应用信息 + 数据信息

通过Checkpoint容错处理
package transformation.state

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}

object StateDemo3 {
  def main(args: Array[String]): Unit = {
    // 使用检查点容错处理
    // 获取或者创建一个SSC
    // 参数一:checkpoint path  参数二:匿名函数 Function0 ---> SSC
    val ssc = StreamingContext.getOrCreate(
      "hdfs://spark:9000/checkpoint170_v2",
      () => {
        // 内容是什么???
        val conf = new SparkConf().setMaster("local[*]").setAppName("word count")
        val ssc = new StreamingContext(conf, Seconds(5))
        ssc.sparkContext.setLogLevel("ERROR")
        val lines = ssc.socketTextStream("localhost", 7777)

        // 必须设置检查点checkpoint 本地状态(内存) + 远程状态(checkpoint)
        // 启动hdfs服务 + 添加虚拟参数-DHADOOP_USER_NAME=root
        ssc.checkpoint("hdfs://spark:9000/checkpoint170_v2")

        // 实现有状态的计算
        lines
          .flatMap(_.split(","))
          .map((_, 1L))
          // 参数:StateSpec
          // StateSpec.function方法 ---> StateSpec
          // state: k=word v=累积次数
          // key:单词  value:次数  state:历史状态
          .mapWithState(StateSpec.function((key: String, value: Option[Long], state: State[Long]) => {
            var newCount = 0L;
            // key存在历史状态数据
            if (state.exists()) {
              newCount = state.get() + value.get
            } else {
              newCount = 1
            }
            // newCount更新State中历史状态数据
            state.update(newCount)
            // 将累积的结果作为DStream中的元素返回
            (key, newCount)
          }))
          .print()
        ssc
      }
    )

    ssc.sparkContext.setLogLevel("ERROR")

    ssc.start()
    ssc.awaitTermination()
  }
}

六、窗口操作

Spark Streaming流式计算框架同样提供了窗口操作,允许在窗口上应用各种转换函数;原理如图所示

springboot与spark的集成_springboot与spark的集成_05

此图描述的是一个滑动跳跃窗口

使用窗口必须指定两个参数:

  • 窗口长度 : window length 窗口的实际长度(上图,窗口的长度为3个时间单位)
  • 滑动步长: sliding interval 滑动的步长或者间距(上图,窗口滑动的步长为2个时间内容)

注:

  • 如果需要定义翻滚窗口,窗口长度=滑动步长即可
  • 会话窗口不支持
  • 必须进行有状态的计算

窗口计算

基于窗口的单词计数:窗口大小10s,每隔5s滑动1次

package window

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WordCountOnWindow {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("wordcount on window").setMaster("local[*]")
    // 注意:微批和窗口的大小必须是倍数关系 不能有半个微批
    // 如:微批1s   窗口 5s
    // 如:微批5s   窗口 10、15、20...
    val ssc = new StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("ERROR")

    // 并没有设定检查点??(程序可以使用:因为有本地状态内存)
    // 建议设置检查点,对本地状态提供远程副本
    ssc.checkpoint("hdfs://spark:9000/checkpoint170_v4")

    val lines = ssc.socketTextStream("localhost", 7777)

    lines
      .flatMap(_.split(","))
      .map((_, 1))
      // 根据key统计窗口的value信息
      .reduceByKeyAndWindow((v1:Int, v2:Int) => v1 + v2, Seconds(10), Seconds(5))
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

一些常用的窗口计算的操作函数

6个

Transformation

Meaning

window(windowLength, slideInterval)

Return a new DStream which is computed based on windowed batches of the source DStream.

countByWindow(windowLength, slideInterval)

Return a sliding window count of elements in the stream.

reduceByWindow(func, windowLength, slideInterval)

Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) 更为高效的版本

A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.

countByValueAndWindow(windowLength, slideInterval, [numTasks])

When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.

结论:窗口的重合比较多建议使用第二个高效方法,窗口的重合数据较少使用第一个

注:

  • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
  • reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) 更为高效的reduceByKeyAndWindow操作

原因:

这两个方法都是根据key对value进行reduce操作,并且这两个方法的操作结果都是一致的,区别是效率不同,后面的方法效率更高;

  • 第一种方法:累积计算(窗口内微批的结果累积)
  • 第二种方法:增量式计算(上一个窗口计算结果 + 当前窗口新增元素-上一个窗口的过期数据=当前窗口的计算结果)

原理剖析:

DStream:10 9 8 7 6 5 4 3 2 1 MicroBatch: b1: 1 b2: 2 b3: 3 ... 窗口计算: length 5 step 3 W1: 1 2 3 4 5 W2: 4 5 6 7 8 W3: 7 8 9 10 11 第一种方法:reduceByKeyAndWindow 累积计算 w1: 1 + 2 + 3 + 4 + 5 = 15 (计算5次) w2: 4 + 5 + 6 + 7 + 8 = 30 (计算5次) w3: ... 第二种方法:reduceByKeyAndWindow 增量式计算 上一个窗口的计算结果 + 当前窗口的新增数据 - 上一个窗口的过期数据 w1: 0 + 1 + 2 + 3 + 4 + 5 - 0 = 15 (计算6次) w2: 15 + 6 + 7 + 8 - 1 - 2 -3 = 30 (计算6次) w3: 30 +9 + 10 +11 -4 -5-6 = 为什么说第二种方法的效率比较高呢? 体现:窗口的长度比较大 而滑动的步长比较小 长度:100s 步长:1 w1: 1 - 100 w2: 2 - 101 w3: 3 - 102 全量式计算: w1:1 + 2 + 3 + 4 ... 100 100次 w2: 2+ 3 + 4 + 5 +101 增量式计算: w1: 1+ 2 + 3 + 100 w2: 5050 + 101 -1 = 当前窗口的结果 2

使用方法
package window.transformation

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * 测试窗口的转换函数(6个)
 */
object TransformationOnWindow {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("wordcount on window").setMaster("local[*]")
    // 注意:微批和窗口的大小必须是倍数关系
    // 如:微批1s   窗口 5s
    // 如:微批5s   窗口 10、15、20...
    val ssc = new StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("ERROR")

    // 并没有设定检查点??(程序可以使用:因为有本地状态内存)
    // 建议设置检查点,对本地状态提供远程副本
    ssc.checkpoint("hdfs://spark:9000/checkpoint170_v5")

    val lines = ssc.socketTextStream("localhost", 7777)

    /*
    lines
      .flatMap(_.split(","))
      .map((_, 1))
      // 只负责划分窗口 不负责处理数据
      //.window(Seconds(10),Seconds(5))
      // 滑动步长默认为1s
      //.window(Seconds(10))

      // countByWindow方法  统计窗口中元素的个数
      //.countByWindow(Seconds(10),Seconds(5))
      .print()
     */

    /*
    lines
        .flatMap(_.split(","))
        .map(strNum => strNum.toInt)
        // reduceByWindow 基于窗口计算
        .reduceByWindow((v1:Int,v2:Int) => v1+v2,Seconds(10),Seconds(5))
        .print()
     */

    /*
    lines
      .flatMap(_.split(","))
      .map((_, 1))
      // 第一个参数:增量式的加法  第二个参数:过期数据的减法
      .reduceByKeyAndWindow(_ + _, _ - _, Seconds(5), Seconds(3))
      .print()
     */


    lines
      .flatMap(_.split(","))  // word
      // countByValueAndWindow 统计当前窗口中单词相同的个数 返回(key,count)
      .countByValueAndWindow(Seconds(5), Seconds(3))
      .print()
    ssc.start()
    ssc.awaitTermination()
  }
}

七、join操作

DStream和DStream(略)

基于窗口DStream和DStream的Join

package join

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}

object DStreamAndDStreamJoinOnWindow {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("streaming wordcount")

    val ssc = new StreamingContext(conf, Seconds(5))
    // 将日志等价变为高级别
    ssc.sparkContext.setLogLevel("ERROR")

    //2. 构建数据源的DStream
    // 通过TCP Source构建DStream对象 获取Tcp请求数据
    val w1 = ssc.socketTextStream("localhost",8888).map((_,1)).window(Seconds(10))
    val w2 = ssc.socketTextStream("localhost",7777).map((_,1)).window(Seconds(15))
    /// 事件时间处理
    w1
      .join(w2)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

DStream和RDD的join(重点)

流和批Join

package join

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * 敏感词过滤
 */
object DStreamAndRDDJoin {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("streaming wordcount")

    val ssc = new StreamingContext(conf, Seconds(5))
    // 将日志等价变为高级别
    ssc.sparkContext.setLogLevel("ERROR")

    // 流数据stream
    val messages = ssc.socketTextStream("localhost", 7777)

    // 批次rdd
    val words = ssc.sparkContext.makeRDD(List(("sb", 1), ("傻逼", 1)))

    messages
      .map((_, 1))
      .transform(rdd => {
        // 连接操作 最好左外连接
        rdd.leftOuterJoin(words)
      })
      .map(t2 => {
        var message = t2._1
        if(!t2._2._2.isEmpty){
          message = "**"
        }
        message
      })
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

注意:

  • 大多数情况下,很少使用流和流的Join,通常都是流和批Join;

八、DStreams输出操作

输出操作指将DStream处理结果写出到外部的存储系统,如:数据库或者Redis、HDFS、HBase等等;

Output Operation

Meaning

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. Python API This is called pprint() in the Python API.

saveAsTextFiles(prefix, [suffix])

Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”.

saveAsObjectFiles(prefix, [suffix])

Save this DStream’s contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. Python API This is not available in the Python API.

saveAsHadoopFiles(prefix, [suffix])

Save this DStream’s contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. Python API This is not available in the Python API.

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

saveAsTextFiles(prefix, [suffix])

将DStream的内容以文本文件的形式保存到应用的所在目录

lines
.flatMap(_.split("\\s")) // DStream ----> DStream
.map((_, 1L))
.groupByKey()
.map(t2 => (t2._1, t2._2.size))
// DStream的计算结果保存到应用的运行目录中
.saveAsTextFiles("result", "xyz")

saveAsObjectFiles(prefix, [suffix])

将DStream的内容以序列化文件的形式保存到应用的所在目录

lines
.flatMap(_.split("\\s")) // DStream ----> DStream
.map((_, 1L))
.groupByKey()
.map(t2 => (t2._1, t2._2.size))
// DStream的计算结果保存到应用的运行目录中
//.saveAsTextFiles("result", "xyz")
.saveAsObjectFiles("result", "xyz")

saveAsNewAPIHadoopFiles(prefix, [suffix])

建议使用saveAsNewAPIHadoopFiles

注意:结果保存在HDFS:/user/用户名/

lines
.flatMap(_.split("\\s")) // DStream ----> DStream
.map((_, 1L))
.groupByKey()
.map(t2 => (t2._1, t2._2.size))
// DStream的计算结果保存到应用的运行目录中
//.saveAsTextFiles("result", "xyz")
//.saveAsObjectFiles("result", "xyz")
// .saveAsHadoopFiles("result","xyz")  //ERROR
.saveAsNewAPIHadoopFiles(
  "result",
  "xyz",
  classOf[Text],
  classOf[LongWritable],
  classOf[TextOutputFormat[Text, LongWritable]],
  conf = hadoopConf)

foreachRDD(func) 【重点】

遍历处理DStream中的批次对应的RDD,可以将每一个微批的RDD的数据写出到任意的外部存储系统,如数据库或者Redis

lines
.flatMap(_.split("\\s")) // DStream ----> DStream
.map((_, 1L))
.groupByKey()
.map(t2 => (t2._1, t2._2.size))
.foreachRDD(rdd => {
  // 将计算的结果保存到redis中
  // rdd输出的操作  一个分组对应一个jedis连接
  /*
        rdd.foreachPartition(itar => {
          val jedis = new Jedis("localhost", 6379)
          while (itar.hasNext) {
            val tuple = itar.next()
            val word = tuple._1
            val count = tuple._2
            jedis.set(word, count.toString)
          }
          jedis.close()
        })
         */
  rdd.foreach(t2 => {
    val jedis = new Jedis("localhost", 6379)
    jedis.set(t2._1, t2._2.toString)
    jedis.close()
  })
})