WindowOperations(窗口操作)

         Spark还提供了窗口的计算,它允许你使用一个滑动窗口应用在数据变换中。下图说明了该滑动窗口。

[spark streaming]窗口操作_spark

如图所示,每个时间窗口在一个个DStream中划过,每个DSteam中的RDD进入Window中进行合并,操作时生成为

窗口化DSteam的RDD。在上图中,该操作被应用在过去的3个时间单位的数据,和划过了2个时间单位。这说明任

何窗口操作都需要指定2个参数。

 

  1. window length(窗口长度):窗口的持续时间(上图为3个时间单位)
  2. sliding interval (滑动间隔)- 窗口操作的时间间隔(上图为2个时间单位)。

 

上面的2个参数的大小,必须是接受产生一个DStream时间的倍数

让我们用一个例子来说明窗口操作。比如说,你想用以前的WordCount的例子,来计算最近30s的数据的中的单词

数,10S接受为一个DStream。为此,我们要用reduceByKey操作来计算最近30s数据中每一个DSteam中关于

(word,1)的pair操作。它可以用reduceByKeyAndWindow操作来实现。一些常见的窗口操作如下。所有这些操作

都需要两个参数--- window length(窗口长度)和sliding interval(滑动间隔)。

[spark streaming]窗口操作_spark_02

 

-------------------------实验数据----------------------------------------------------------------------

spark
Streaming
better
than
storm
you
need
it
yes
do
it

 

(每秒在其中随机抽取一个,作为Socket端的输入),socket端的数据模拟和实验函数等程序见附录百度云链接

-----------------------------------------------window操作-------------------------------------------------------------------------

1. ​​//输入:窗口长度(隐:输入的滑动窗口长度为形成Dstream的时间)​​
2. ​​//输出:返回一个DStream,這个DStream包含這个滑动窗口下的全部元素​​
3. ​​def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)​​
4.
5. ​​//输入:窗口长度和滑动窗口长度​​
6. ​​//输出:返回一个DStream,這个DStream包含這个滑动窗口下的全部元素​​
7. ​​def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {​​
8. ​​new WindowedDStream(this, windowDuration, slideDuration)​​
9. ​​}​​

1. ​​import org.apache.log4j.{Level, Logger}​​
2. ​​import org.apache.spark.streaming.{Seconds, StreamingContext}​​
3. ​​import org.apache.spark.{SparkConf, SparkContext}​​
4.
5. ​​object windowOnStreaming {​​
6. ​​def main(args: Array[String]) {​​
7. ​​/**​​
8. ​​* this is test of Streaming operations-----window​​
9. ​​*/​​
10. ​​Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)​​
11. ​​Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)​​
12.
13. ​​val conf = new SparkConf().setAppName("the Window operation of SparK Streaming").setMaster("local[2]")​​
14. ​​val sc = new SparkContext(conf)​​
15. ​​val ssc = new StreamingContext(sc,Seconds(2))​​
16.
17.
18. ​​//set the Checkpoint directory​​
19. ​​ssc.checkpoint("/Res")​​
20.
21. ​​//get the socket Streaming data​​
22. ​​val socketStreaming = ssc.socketTextStream("master",9999)​​
23.
24. ​​val data = socketStreaming.map(x =>(x,1))​​
25. ​​//def window(windowDuration: Duration): DStream[T]​​
26. ​​val getedData1 = data.window(Seconds(6))​​
27. ​​println("windowDuration only : ")​​
28. ​​getedData1.print()​​
29. ​​//same as​​
30. ​​// def window(windowDuration: Duration, slideDuration: Duration): DStream[T]​​
31. ​​//val getedData2 = data.window(Seconds(9),Seconds(3))​​
32. ​​//println("Duration and SlideDuration : ")​​
33. ​​//getedData2.print()​​
34.
35. ​​ssc.start()​​
36. ​​ssc.awaitTermination()​​
37. ​​}​​
38.
39. ​​}​​

 

 

[spark streaming]窗口操作_spark_03


 

 

 

--------------------reduceByKeyAndWindow操作--------------------------------

1. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像​​
2. ​​* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群​​
3. ​​* 默认的分区树​​
4. ​​* @param reduceFunc 从左到右的reduce 函数​​
5. ​​* @param windowDuration 窗口时间​​
6. ​​* 滑动窗口默认是1个batch interval​​
7. ​​* 分区数是是RDD默认(depend on spark集群core)​​
8. ​​*/​​
9. ​​def reduceByKeyAndWindow(​​
10. ​​reduceFunc: (V, V) => V,​​
11. ​​windowDuration: Duration​​
12. ​​): DStream[(K, V)] = ssc.withScope {​​
13. ​​reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())​​
14. ​​}​​
15.
16. ​​/**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像​​
17. ​​* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群​​
18. ​​* 默认的分区树​​
19. ​​* @param reduceFunc 从左到右的reduce 函数​​
20. ​​* @param windowDuration 窗口时间​​
21. ​​* @param slideDuration 滑动时间​​
22. ​​*/​​
23. ​​def reduceByKeyAndWindow(​​
24. ​​reduceFunc: (V, V) => V,​​
25. ​​windowDuration: Duration,​​
26. ​​slideDuration: Duration​​
27. ​​): DStream[(K, V)] = ssc.withScope {​​
28. ​​reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())​​
29. ​​}​​
30.
31.
32. ​​/**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像​​
33. ​​* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群​​
34. ​​* 默认的分区树​​
35. ​​* @param reduceFunc 从左到右的reduce 函数​​
36. ​​* @param windowDuration 窗口时间​​
37. ​​* @param slideDuration 滑动时间​​
38.
39. ​​* @param numPartitions 每个RDD的分区数.​​
40. ​​*/​​
41. ​​def reduceByKeyAndWindow(​​
42. ​​reduceFunc: (V, V) => V,​​
43. ​​windowDuration: Duration,​​
44. ​​slideDuration: Duration,​​
45. ​​numPartitions: Int​​
46. ​​): DStream[(K, V)] = ssc.withScope {​​
47. ​​reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,​​
48. ​​defaultPartitioner(numPartitions))​​
49. ​​}​​
50.
51. ​​/**​​
52. ​​/**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像​​
53. ​​* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群​​
54. ​​* 默认的分区树​​
55. ​​* @param reduceFunc 从左到右的reduce 函数​​
56. ​​* @param windowDuration 窗口时间​​
57. ​​* @param slideDuration 滑动时间​​
58.
59. ​​* @param numPartitions 每个RDD的分区数.​​
60. ​​* @param partitioner 设置每个partition的分区数​​
61. ​​*/​​
62. ​​def reduceByKeyAndWindow(​​
63. ​​reduceFunc: (V, V) => V,​​
64. ​​windowDuration: Duration,​​
65. ​​slideDuration: Duration,​​
66. ​​partitioner: Partitioner​​
67. ​​): DStream[(K, V)] = ssc.withScope {​​
68. ​​self.reduceByKey(reduceFunc, partitioner)​​
69. ​​.window(windowDuration, slideDuration)​​
70. ​​.reduceByKey(reduceFunc, partitioner)​​
71. ​​}​​
72.
73. ​​/**​​
74. ​​*通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作​​
75. ​​* hash分区是采用spark集群,默认的分区树​​
76. ​​* @param reduceFunc从左到右的reduce 函数​​
77. ​​* @param invReduceFunc inverse reduce function; such that for all y, invertible x:​​
78. ​​* `invReduceFunc(reduceFunc(x, y), x) = y`​​
79. ​​* @param windowDuration窗口时间​​
80. ​​* @param slideDuration 滑动时间​​
81. ​​* @param filterFunc 来赛选一定条件的 key-value 对的​​
82. ​​*/​​
83. ​​def reduceByKeyAndWindow(​​
84. ​​reduceFunc: (V, V) => V,​​
85. ​​invReduceFunc: (V, V) => V,​​
86. ​​windowDuration: Duration,​​
87. ​​slideDuration: Duration = self.slideDuration,​​
88. ​​numPartitions: Int = ssc.sc.defaultParallelism,​​
89. ​​filterFunc: ((K, V)) => Boolean = null​​
90. ​​): DStream[(K, V)] = ssc.withScope {​​
91. ​​reduceByKeyAndWindow(​​
92. ​​reduceFunc, invReduceFunc, windowDuration,​​
93. ​​slideDuration, defaultPartitioner(numPartitions), filterFunc​​
94. ​​)​​
95. ​​}​​
96.
97. ​​/**​​
98. ​​*通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作​​
99. ​​* hash分区是采用spark集群,默认的分区树​​
100. ​​* @param reduceFunc从左到右的reduce 函数​​
101. ​​* @param invReduceFunc inverse reduce function; such that for all y, invertible x:​​
102. ​​* `invReduceFunc(reduceFunc(x, y), x) = y`​​
103. ​​* @param windowDuration窗口时间​​
104. ​​* @param slideDuration 滑动时间​​
105. ​​* @param partitioner 每个RDD的分区数.​​
106. ​​* @param filterFunc 来赛选一定条件的 key-value 对的​​
107. ​​*/​​
108. ​​def reduceByKeyAndWindow(​​
109. ​​reduceFunc: (V, V) => V,​​
110. ​​invReduceFunc: (V, V) => V,​​
111. ​​windowDuration: Duration,​​
112. ​​slideDuration: Duration,​​
113. ​​partitioner: Partitioner,​​
114. ​​filterFunc: ((K, V)) => Boolean​​
115. ​​): DStream[(K, V)] = ssc.withScope {​​
116.
117. ​​val cleanedReduceFunc = ssc.sc.clean(reduceFunc)​​
118. ​​val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)​​
119. ​​val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None​​
120. ​​new ReducedWindowedDStream[K, V](​​
121. ​​self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,​​
122. ​​windowDuration, slideDuration, partitioner​​
123. ​​)​​
124. ​​}​​
1. ​​import org.apache.log4j.{Level, Logger}​​
2. ​​import org.apache.spark.streaming.{Seconds, StreamingContext}​​
3. ​​import org.apache.spark.{SparkConf, SparkContext}​​
4.
5.
6. ​​object reduceByWindowOnStreaming {​​
7.
8. ​​def main(args: Array[String]) {​​
9. ​​/**​​
10. ​​* this is test of Streaming operations-----reduceByKeyAndWindow​​
11. ​​*/​​
12. ​​Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)​​
13. ​​Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)​​
14.
15. ​​val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")​​
16. ​​val sc = new SparkContext(conf)​​
17. ​​val ssc = new StreamingContext(sc,Seconds(2))​​
18.
19. ​​//set the Checkpoint directory​​
20. ​​ssc.checkpoint("/Res")​​
21.
22. ​​//get the socket Streaming data​​
23. ​​val socketStreaming = ssc.socketTextStream("master",9999)​​
24.
25. ​​val data = socketStreaming.map(x =>(x,1))​​
26. ​​//def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration ): DStream[(K, V)]​​
27. ​​//val getedData1 = data.reduceByKeyAndWindow(_+_,Seconds(6))​​
28.
29. ​​val getedData2 = data.reduceByKeyAndWindow(_+_,​​
30. ​​(a,b) => a+b*0​​
31. ​​,Seconds(6),Seconds(2))​​
32.
33. ​​val getedData1 = data.reduceByKeyAndWindow(_+_,_-_,Seconds(9),Seconds(6))​​
34.
35. ​​println("reduceByKeyAndWindow : ")​​
36. ​​getedData1.print()​​
37.
38. ​​ssc.start()​​
39. ​​ssc.awaitTermination()​​
40.
41.
42. ​​}​​
43. ​​}​​

 

 

[spark streaming]窗口操作_spark_04


這里出现了invReduceFunc函数這个函数有点特别,一不注意就会出错,现在通过分析源码中的

ReducedWindowedDStream這个类内部来进行说明:

[spark streaming]窗口操作_apache_05

 

------------------reduceByWindow操作---------------------------


 

1. ​​/输入:reduceFunc、窗口长度、滑动长度​​
2. ​​//输出:(a,b)为从几个从左到右一次取得两个元素​​
3. ​​//(,a,b)进入reduceFunc,​​
4. ​​def reduceByWindow(​​
5. ​​reduceFunc: (T, T) => T,​​
6. ​​windowDuration: Duration,​​
7. ​​slideDuration: Duration​​
8. ​​): DStream[T] = ssc.withScope {​​
9. ​​this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)​​
10. ​​}​​
11. ​​/**​​
12. ​​*输入reduceFunc,invReduceFunc,窗口长度、滑动长度​​
13. ​​*/​​
14. ​​def reduceByWindow(​​
15. ​​reduceFunc: (T, T) => T,​​
16. ​​invReduceFunc: (T, T) => T,​​
17. ​​windowDuration: Duration,​​
18. ​​slideDuration: Duration​​
19. ​​): DStream[T] = ssc.withScope {​​
20. ​​this.map((1, _))​​
21. ​​.reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)​​
22. ​​.map(_._2)​​
23. ​​}​​

1. ​​import org.apache.log4j.{Level, Logger}​​
2. ​​import org.apache.spark.streaming.{Seconds, StreamingContext}​​
3. ​​import org.apache.spark.{SparkConf, SparkContext}​​
4.
5. ​​/**​​
6. ​​* Created by root on 6/23/16.​​
7. ​​*/​​
8. ​​object reduceByWindow {​​
9. ​​def main(args: Array[String]) {​​
10. ​​/**​​
11. ​​* this is test of Streaming operations-----reduceByWindow​​
12. ​​*/​​
13. ​​Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)​​
14. ​​Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)​​
15.
16. ​​val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")​​
17. ​​val sc = new SparkContext(conf)​​
18. ​​val ssc = new StreamingContext(sc,Seconds(2))​​
19. ​​//set the Checkpoint directory​​
20. ​​ssc.checkpoint("/Res")​​
21.
22. ​​//get the socket Streaming data​​
23. ​​val socketStreaming = ssc.socketTextStream("master",9999)​​
24.
25. ​​//val data = socketStreaming.reduceByWindow(_+_,Seconds(6),Seconds(2))​​
26. ​​val data = socketStreaming.reduceByWindow(_+_,_+_,Seconds(6),Seconds(2))​​
27.
28.
29. ​​println("reduceByWindow: count the number of elements")​​
30. ​​data.print()​​
31.
32.
33. ​​ssc.start()​​
34. ​​ssc.awaitTermination()​​
35.
36. ​​}​​
37. ​​}​​

 

 

[spark streaming]窗口操作_spark_06

 

 

-----------------------------------------------countByWindow操作---------------------------------

 

1. /**​​
2. ​​* 输入 窗口长度和滑动长度,返回窗口内的元素数量​​
3. ​​* @param windowDuration 窗口长度​​
4. ​​* @param slideDuration 滑动长度​​
5. ​​*/​​
6. ​​def countByWindow(​​
7. ​​windowDuration: Duration,​​
8. ​​slideDuration: Duration): DStream[Long] = ssc.withScope {​​
9. ​​this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)​​
10. ​​//窗口下的DStream进行map操作,把每个元素变为1之后进行reduceByWindow操作​​
11. ​​}​​

 

1. import org.apache.log4j.{Level, Logger}​​
2. ​​import org.apache.spark.streaming.{Seconds, StreamingContext}​​
3. ​​import org.apache.spark.{SparkConf, SparkContext}​​
4.
5. ​​/**​​
6. ​​* Created by root on 6/23/16.​​
7. ​​*/​​
8. ​​object countByWindow {​​
9. ​​def main(args: Array[String]) {​​
10.
11. ​​/**​​
12. ​​* this is test of Streaming operations-----countByWindow​​
13. ​​*/​​
14. ​​Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)​​
15. ​​Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)​​
16.
17. ​​val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")​​
18. ​​val sc = new SparkContext(conf)​​
19. ​​val ssc = new StreamingContext(sc,Seconds(2))​​
20. ​​//set the Checkpoint directory​​
21. ​​ssc.checkpoint("/Res")​​
22.
23. ​​//get the socket Streaming data​​
24. ​​val socketStreaming = ssc.socketTextStream("master",9999)​​
25.
26. ​​val data = socketStreaming.countByWindow(Seconds(6),Seconds(2))​​
27.
28.
29. ​​println("countByWindow: count the number of elements")​​
30. ​​data.print()​​
31.
32.
33. ​​ssc.start()​​
34. ​​ssc.awaitTermination()​​
35.
36.
37. ​​}​​
38. ​​}​​

 

-------------------------------- countByValueAndWindow-------------

 

1. /**​​
2. ​​*输入 窗口长度、滑动时间、RDD分区数(默认分区是等于并行度)​​
3. ​​* @param windowDuration width of the window; must be a multiple of this DStream's​​
4. ​​* batching interval​​
5. ​​* @param slideDuration sliding interval of the window (i.e., the interval after which​​
6. ​​* the new DStream will generate RDDs); must be a multiple of this​​
7. ​​* DStream's batching interval​​
8. ​​* @param numPartitions number of partitions of each RDD in the new DStream.​​
9. ​​*/​​
10. ​​def countByValueAndWindow(​​
11. ​​windowDuration: Duration,​​
12. ​​slideDuration: Duration,​​
13. ​​numPartitions: Int = ssc.sc.defaultParallelism)​​
14. ​​(implicit ord: Ordering[T] = null)​​
15. ​​: DStream[(T, Long)] = ssc.withScope {​​
16. ​​this.map((_, 1L)).reduceByKeyAndWindow(​​
17. ​​(x: Long, y: Long) => x + y,​​
18. ​​(x: Long, y: Long) => x - y,​​
19. ​​windowDuration,​​
20. ​​slideDuration,​​
21. ​​numPartitions,​​
22. ​​(x: (T, Long)) => x._2 != 0L​​
23. ​​)​​
24. ​​}​​

1. ​​import org.apache.log4j.{Level, Logger}​​
2. ​​import org.apache.spark.streaming.{Seconds, StreamingContext}​​
3. ​​import org.apache.spark.{SparkConf, SparkContext}​​
4.
5. ​​/**​​
6. ​​* Created by root on 6/23/16.​​
7. ​​*/​​
8. ​​object countByValueAndWindow {​​
9. ​​def main(args: Array[String]) {​​
10. ​​/**​​
11. ​​* this is test of Streaming operations-----countByValueAndWindow​​
12. ​​*/​​
13. ​​Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)​​
14. ​​Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)​​
15.
16. ​​val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")​​
17. ​​val sc = new SparkContext(conf)​​
18. ​​val ssc = new StreamingContext(sc,Seconds(2))​​
19. ​​//set the Checkpoint directory​​
20. ​​ssc.checkpoint("/Res")​​
21.
22. ​​//get the socket Streaming data​​
23. ​​val socketStreaming = ssc.socketTextStream("master",9999)​​
24.
25. ​​val data = socketStreaming.countByValueAndWindow(Seconds(6),Seconds(2))​​
26.
27.
28. ​​println("countByWindow: count the number of elements")​​
29. ​​data.print()​​
30.
31.
32. ​​ssc.start()​​
33. ​​ssc.awaitTermination()​​
34. ​​}​​
35.
36. ​​}​​

 

 

 

[spark streaming]窗口操作_spark_07


 

附录

链接:http://pan.baidu.com/s/1slkqwBb 密码:d92r