[spark streaming]窗口操作

转载

qq59caeb714a7a4 2022-08-09 16:54:41 博主文章分类：spark

文章标签 spark ide apache 文章分类 运维

WindowOperations（窗口操作）

Spark还提供了窗口的计算，它允许你使用一个滑动窗口应用在数据变换中。下图说明了该滑动窗口。

[spark streaming]窗口操作_spark

如图所示，每个时间窗口在一个个DStream中划过，每个DSteam中的RDD进入Window中进行合并，操作时生成为

窗口化DSteam的RDD。在上图中，该操作被应用在过去的3个时间单位的数据，和划过了2个时间单位。这说明任

何窗口操作都需要指定2个参数。

window length（窗口长度）：窗口的持续时间（上图为3个时间单位）
sliding interval （滑动间隔）- 窗口操作的时间间隔（上图为2个时间单位）。

上面的2个参数的大小，必须是接受产生一个DStream时间的倍数

让我们用一个例子来说明窗口操作。比如说，你想用以前的WordCount的例子，来计算最近30s的数据的中的单词

数，10S接受为一个DStream。为此，我们要用reduceByKey操作来计算最近30s数据中每一个DSteam中关于

（word，1）的pair操作。它可以用reduceByKeyAndWindow操作来实现。一些常见的窗口操作如下。所有这些操作

都需要两个参数--- window length（窗口长度）和sliding interval（滑动间隔）。

[spark streaming]窗口操作_spark_02

-------------------------实验数据----------------------------------------------------------------------

spark
Streaming
better
than
storm
you
need
it
yes
do
it

（每秒在其中随机抽取一个，作为Socket端的输入），socket端的数据模拟和实验函数等程序见附录百度云链接

-----------------------------------------------window操作-------------------------------------------------------------------------

1. //输入:窗口长度（隐：输入的滑动窗口长度为形成Dstream的时间）
2. //输出：返回一个DStream,這个DStream包含這个滑动窗口下的全部元素
3. def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)
4. 
5. //输入:窗口长度和滑动窗口长度
6. //输出：返回一个DStream,這个DStream包含這个滑动窗口下的全部元素
7. def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
8. new WindowedDStream(this, windowDuration, slideDuration)
9. }

1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4. 
5. object windowOnStreaming {
6. def main(args: Array[String]) {
7. /**
8. * this is test of Streaming operations-----window
9. */
10. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
11. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
12. 
13. val conf = new SparkConf().setAppName("the Window operation of SparK Streaming").setMaster("local[2]")
14. val sc = new SparkContext(conf)
15. val ssc = new StreamingContext(sc,Seconds(2))
16. 
17. 
18. //set the Checkpoint directory
19. ssc.checkpoint("/Res")
20. 
21. //get the socket Streaming data
22. val socketStreaming = ssc.socketTextStream("master",9999)
23. 
24. val data = socketStreaming.map(x =>(x,1))
25. //def window(windowDuration: Duration): DStream[T]
26. val getedData1 = data.window(Seconds(6))
27. println("windowDuration only : ")
28. getedData1.print()
29. //same as
30. // def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
31. //val getedData2 = data.window(Seconds(9),Seconds(3))
32. //println("Duration and SlideDuration : ")
33. //getedData2.print()
34. 
35. ssc.start()
36. ssc.awaitTermination()
37. }
38. 
39. }

[spark streaming]窗口操作_spark_03

--------------------reduceByKeyAndWindow操作--------------------------------

1. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作，返回一个DSream，有点像
2. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口，hash分区是采用spark集群
3. * 默认的分区树
4. * @param reduceFunc 从左到右的reduce 函数
5. * @param windowDuration 窗口时间
6. * 滑动窗口默认是1个batch interval
7. * 分区数是是RDD默认（depend on spark集群core）
8. */
9. def reduceByKeyAndWindow(
10. reduceFunc: (V, V) => V,
11. windowDuration: Duration
12. ): DStream[(K, V)] = ssc.withScope {
13. reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())
14. }
15. 
16. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作，返回一个DSream，有点像
17. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口，hash分区是采用spark集群
18. * 默认的分区树
19. * @param reduceFunc 从左到右的reduce 函数
20. * @param windowDuration 窗口时间
21. * @param slideDuration 滑动时间
22. */
23. def reduceByKeyAndWindow(
24. reduceFunc: (V, V) => V,
25. windowDuration: Duration,
26. slideDuration: Duration
27. ): DStream[(K, V)] = ssc.withScope {
28. reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
29. }
30. 
31. 
32. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作，返回一个DSream，有点像
33. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口，hash分区是采用spark集群
34. * 默认的分区树
35. * @param reduceFunc 从左到右的reduce 函数
36. * @param windowDuration 窗口时间
37. * @param slideDuration 滑动时间
38. 
39. * @param numPartitions 每个RDD的分区数.
40. */
41. def reduceByKeyAndWindow(
42. reduceFunc: (V, V) => V,
43. windowDuration: Duration,
44. slideDuration: Duration,
45. numPartitions: Int
46. ): DStream[(K, V)] = ssc.withScope {
47. reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,
48. defaultPartitioner(numPartitions))
49. }
50. 
51. /**
52. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作，返回一个DSream，有点像
53. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口，hash分区是采用spark集群
54. * 默认的分区树
55. * @param reduceFunc 从左到右的reduce 函数
56. * @param windowDuration 窗口时间
57. * @param slideDuration 滑动时间
58. 
59. * @param numPartitions 每个RDD的分区数.
60. * @param partitioner 设置每个partition的分区数
61. */
62. def reduceByKeyAndWindow(
63. reduceFunc: (V, V) => V,
64. windowDuration: Duration,
65. slideDuration: Duration,
66. partitioner: Partitioner
67. ): DStream[(K, V)] = ssc.withScope {
68. self.reduceByKey(reduceFunc, partitioner)
69. .window(windowDuration, slideDuration)
70. .reduceByKey(reduceFunc, partitioner)
71. }
72. 
73. /**
74. *通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作
75. * hash分区是采用spark集群，默认的分区树
76. * @param reduceFunc从左到右的reduce 函数
77. * @param invReduceFunc inverse reduce function; such that for all y, invertible x:
78. * `invReduceFunc(reduceFunc(x, y), x) = y`
79. * @param windowDuration窗口时间
80. * @param slideDuration 滑动时间
81. * @param filterFunc 来赛选一定条件的 key-value 对的
82. */
83. def reduceByKeyAndWindow(
84. reduceFunc: (V, V) => V,
85. invReduceFunc: (V, V) => V,
86. windowDuration: Duration,
87. slideDuration: Duration = self.slideDuration,
88. numPartitions: Int = ssc.sc.defaultParallelism,
89. filterFunc: ((K, V)) => Boolean = null
90. ): DStream[(K, V)] = ssc.withScope {
91. reduceByKeyAndWindow(
92. reduceFunc, invReduceFunc, windowDuration,
93. slideDuration, defaultPartitioner(numPartitions), filterFunc
94. )
95. }
96. 
97. /**
98. *通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作
99. * hash分区是采用spark集群，默认的分区树
100. * @param reduceFunc从左到右的reduce 函数
101. * @param invReduceFunc inverse reduce function; such that for all y, invertible x:
102. * `invReduceFunc(reduceFunc(x, y), x) = y`
103. * @param windowDuration窗口时间
104. * @param slideDuration 滑动时间
105. * @param partitioner 每个RDD的分区数.
106. * @param filterFunc 来赛选一定条件的 key-value 对的
107. */
108. def reduceByKeyAndWindow(
109. reduceFunc: (V, V) => V,
110. invReduceFunc: (V, V) => V,
111. windowDuration: Duration,
112. slideDuration: Duration,
113. partitioner: Partitioner,
114. filterFunc: ((K, V)) => Boolean
115. ): DStream[(K, V)] = ssc.withScope {
116. 
117. val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
118. val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
119. val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
120. new ReducedWindowedDStream[K, V](
121. self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
122. windowDuration, slideDuration, partitioner
123. )
124. }

1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4. 
5. 
6. object reduceByWindowOnStreaming {
7. 
8. def main(args: Array[String]) {
9. /**
10. * this is test of Streaming operations-----reduceByKeyAndWindow
11. */
12. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
13. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
14. 
15. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
16. val sc = new SparkContext(conf)
17. val ssc = new StreamingContext(sc,Seconds(2))
18. 
19. //set the Checkpoint directory
20. ssc.checkpoint("/Res")
21. 
22. //get the socket Streaming data
23. val socketStreaming = ssc.socketTextStream("master",9999)
24. 
25. val data = socketStreaming.map(x =>(x,1))
26. //def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration ): DStream[(K, V)]
27. //val getedData1 = data.reduceByKeyAndWindow(_+_,Seconds(6))
28. 
29. val getedData2 = data.reduceByKeyAndWindow(_+_,
30. (a,b) => a+b*0
31. ,Seconds(6),Seconds(2))
32. 
33. val getedData1 = data.reduceByKeyAndWindow(_+_,_-_,Seconds(9),Seconds(6))
34. 
35. println("reduceByKeyAndWindow : ")
36. getedData1.print()
37. 
38. ssc.start()
39. ssc.awaitTermination()
40. 
41. 
42. }
43. }

[spark streaming]窗口操作_spark_04

這里出现了invReduceFunc函数這个函数有点特别，一不注意就会出错，现在通过分析源码中的

ReducedWindowedDStream這个类内部来进行说明：

[spark streaming]窗口操作_apache_05

------------------reduceByWindow操作---------------------------

1. /输入：reduceFunc、窗口长度、滑动长度
2. //输出：（a,b）为从几个从左到右一次取得两个元素
3. //（，a,b）进入reduceFunc,
4. def reduceByWindow(
5. reduceFunc: (T, T) => T,
6. windowDuration: Duration,
7. slideDuration: Duration
8. ): DStream[T] = ssc.withScope {
9. this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
10. }
11. /**
12. *输入reduceFunc，invReduceFunc，窗口长度、滑动长度
13. */
14. def reduceByWindow(
15. reduceFunc: (T, T) => T,
16. invReduceFunc: (T, T) => T,
17. windowDuration: Duration,
18. slideDuration: Duration
19. ): DStream[T] = ssc.withScope {
20. this.map((1, _))
21. .reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
22. .map(_._2)
23. }

1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4. 
5. /**
6. * Created by root on 6/23/16.
7. */
8. object reduceByWindow {
9. def main(args: Array[String]) {
10. /**
11. * this is test of Streaming operations-----reduceByWindow
12. */
13. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
14. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
15. 
16. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
17. val sc = new SparkContext(conf)
18. val ssc = new StreamingContext(sc,Seconds(2))
19. //set the Checkpoint directory
20. ssc.checkpoint("/Res")
21. 
22. //get the socket Streaming data
23. val socketStreaming = ssc.socketTextStream("master",9999)
24. 
25. //val data = socketStreaming.reduceByWindow(_+_,Seconds(6),Seconds(2))
26. val data = socketStreaming.reduceByWindow(_+_,_+_,Seconds(6),Seconds(2))
27. 
28. 
29. println("reduceByWindow: count the number of elements")
30. data.print()
31. 
32. 
33. ssc.start()
34. ssc.awaitTermination()
35. 
36. }
37. }

[spark streaming]窗口操作_spark_06

-----------------------------------------------countByWindow操作---------------------------------

1. /**
2. * 输入 窗口长度和滑动长度，返回窗口内的元素数量
3. * @param windowDuration 窗口长度
4. * @param slideDuration 滑动长度
5. */
6. def countByWindow(
7. windowDuration: Duration,
8. slideDuration: Duration): DStream[Long] = ssc.withScope {
9. this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
10. //窗口下的DStream进行map操作，把每个元素变为1之后进行reduceByWindow操作
11. }

1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4. 
5. /**
6. * Created by root on 6/23/16.
7. */
8. object countByWindow {
9. def main(args: Array[String]) {
10. 
11. /**
12. * this is test of Streaming operations-----countByWindow
13. */
14. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
15. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
16. 
17. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
18. val sc = new SparkContext(conf)
19. val ssc = new StreamingContext(sc,Seconds(2))
20. //set the Checkpoint directory
21. ssc.checkpoint("/Res")
22. 
23. //get the socket Streaming data
24. val socketStreaming = ssc.socketTextStream("master",9999)
25. 
26. val data = socketStreaming.countByWindow(Seconds(6),Seconds(2))
27. 
28. 
29. println("countByWindow: count the number of elements")
30. data.print()
31. 
32. 
33. ssc.start()
34. ssc.awaitTermination()
35. 
36. 
37. }
38. }

-------------------------------- countByValueAndWindow-------------

1. /**
2. *输入 窗口长度、滑动时间、RDD分区数（默认分区是等于并行度）
3. * @param windowDuration width of the window; must be a multiple of this DStream's
4. * batching interval
5. * @param slideDuration sliding interval of the window (i.e., the interval after which
6. * the new DStream will generate RDDs); must be a multiple of this
7. * DStream's batching interval
8. * @param numPartitions number of partitions of each RDD in the new DStream.
9. */
10. def countByValueAndWindow(
11. windowDuration: Duration,
12. slideDuration: Duration,
13. numPartitions: Int = ssc.sc.defaultParallelism)
14. (implicit ord: Ordering[T] = null)
15. : DStream[(T, Long)] = ssc.withScope {
16. this.map((_, 1L)).reduceByKeyAndWindow(
17. (x: Long, y: Long) => x + y,
18. (x: Long, y: Long) => x - y,
19. windowDuration,
20. slideDuration,
21. numPartitions,
22. (x: (T, Long)) => x._2 != 0L
23. )
24. }

1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4. 
5. /**
6. * Created by root on 6/23/16.
7. */
8. object countByValueAndWindow {
9. def main(args: Array[String]) {
10. /**
11. * this is test of Streaming operations-----countByValueAndWindow
12. */
13. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
14. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
15. 
16. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
17. val sc = new SparkContext(conf)
18. val ssc = new StreamingContext(sc,Seconds(2))
19. //set the Checkpoint directory
20. ssc.checkpoint("/Res")
21. 
22. //get the socket Streaming data
23. val socketStreaming = ssc.socketTextStream("master",9999)
24. 
25. val data = socketStreaming.countByValueAndWindow(Seconds(6),Seconds(2))
26. 
27. 
28. println("countByWindow: count the number of elements")
29. data.print()
30. 
31. 
32. ssc.start()
33. ssc.awaitTermination()
34. }
35. 
36. }