WindowOperations(窗口操作)
Spark还提供了窗口的计算,它允许你使用一个滑动窗口应用在数据变换中。下图说明了该滑动窗口。
如图所示,每个时间窗口在一个个DStream中划过,每个DSteam中的RDD进入Window中进行合并,操作时生成为
窗口化DSteam的RDD。在上图中,该操作被应用在过去的3个时间单位的数据,和划过了2个时间单位。这说明任
何窗口操作都需要指定2个参数。
- window length(窗口长度):窗口的持续时间(上图为3个时间单位)
- sliding interval (滑动间隔)- 窗口操作的时间间隔(上图为2个时间单位)。
上面的2个参数的大小,必须是接受产生一个DStream时间的倍数
让我们用一个例子来说明窗口操作。比如说,你想用以前的WordCount的例子,来计算最近30s的数据的中的单词
数,10S接受为一个DStream。为此,我们要用reduceByKey操作来计算最近30s数据中每一个DSteam中关于
(word,1)的pair操作。它可以用reduceByKeyAndWindow操作来实现。一些常见的窗口操作如下。所有这些操作
都需要两个参数--- window length(窗口长度)和sliding interval(滑动间隔)。
-------------------------实验数据----------------------------------------------------------------------
spark
Streaming
better
than
storm
you
need
it
yes
do
it
(每秒在其中随机抽取一个,作为Socket端的输入),socket端的数据模拟和实验函数等程序见附录百度云链接
-----------------------------------------------window操作-------------------------------------------------------------------------
1. //输入:窗口长度(隐:输入的滑动窗口长度为形成Dstream的时间)
2. //输出:返回一个DStream,這个DStream包含這个滑动窗口下的全部元素
3. def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)
4.
5. //输入:窗口长度和滑动窗口长度
6. //输出:返回一个DStream,這个DStream包含這个滑动窗口下的全部元素
7. def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
8. new WindowedDStream(this, windowDuration, slideDuration)
9. }
1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4.
5. object windowOnStreaming {
6. def main(args: Array[String]) {
7. /**
8. * this is test of Streaming operations-----window
9. */
10. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
11. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
12.
13. val conf = new SparkConf().setAppName("the Window operation of SparK Streaming").setMaster("local[2]")
14. val sc = new SparkContext(conf)
15. val ssc = new StreamingContext(sc,Seconds(2))
16.
17.
18. //set the Checkpoint directory
19. ssc.checkpoint("/Res")
20.
21. //get the socket Streaming data
22. val socketStreaming = ssc.socketTextStream("master",9999)
23.
24. val data = socketStreaming.map(x =>(x,1))
25. //def window(windowDuration: Duration): DStream[T]
26. val getedData1 = data.window(Seconds(6))
27. println("windowDuration only : ")
28. getedData1.print()
29. //same as
30. // def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
31. //val getedData2 = data.window(Seconds(9),Seconds(3))
32. //println("Duration and SlideDuration : ")
33. //getedData2.print()
34.
35. ssc.start()
36. ssc.awaitTermination()
37. }
38.
39. }
--------------------reduceByKeyAndWindow操作--------------------------------
1. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像
2. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群
3. * 默认的分区树
4. * @param reduceFunc 从左到右的reduce 函数
5. * @param windowDuration 窗口时间
6. * 滑动窗口默认是1个batch interval
7. * 分区数是是RDD默认(depend on spark集群core)
8. */
9. def reduceByKeyAndWindow(
10. reduceFunc: (V, V) => V,
11. windowDuration: Duration
12. ): DStream[(K, V)] = ssc.withScope {
13. reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())
14. }
15.
16. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像
17. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群
18. * 默认的分区树
19. * @param reduceFunc 从左到右的reduce 函数
20. * @param windowDuration 窗口时间
21. * @param slideDuration 滑动时间
22. */
23. def reduceByKeyAndWindow(
24. reduceFunc: (V, V) => V,
25. windowDuration: Duration,
26. slideDuration: Duration
27. ): DStream[(K, V)] = ssc.withScope {
28. reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
29. }
30.
31.
32. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像
33. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群
34. * 默认的分区树
35. * @param reduceFunc 从左到右的reduce 函数
36. * @param windowDuration 窗口时间
37. * @param slideDuration 滑动时间
38.
39. * @param numPartitions 每个RDD的分区数.
40. */
41. def reduceByKeyAndWindow(
42. reduceFunc: (V, V) => V,
43. windowDuration: Duration,
44. slideDuration: Duration,
45. numPartitions: Int
46. ): DStream[(K, V)] = ssc.withScope {
47. reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,
48. defaultPartitioner(numPartitions))
49. }
50.
51. /**
52. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像
53. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群
54. * 默认的分区树
55. * @param reduceFunc 从左到右的reduce 函数
56. * @param windowDuration 窗口时间
57. * @param slideDuration 滑动时间
58.
59. * @param numPartitions 每个RDD的分区数.
60. * @param partitioner 设置每个partition的分区数
61. */
62. def reduceByKeyAndWindow(
63. reduceFunc: (V, V) => V,
64. windowDuration: Duration,
65. slideDuration: Duration,
66. partitioner: Partitioner
67. ): DStream[(K, V)] = ssc.withScope {
68. self.reduceByKey(reduceFunc, partitioner)
69. .window(windowDuration, slideDuration)
70. .reduceByKey(reduceFunc, partitioner)
71. }
72.
73. /**
74. *通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作
75. * hash分区是采用spark集群,默认的分区树
76. * @param reduceFunc从左到右的reduce 函数
77. * @param invReduceFunc inverse reduce function; such that for all y, invertible x:
78. * `invReduceFunc(reduceFunc(x, y), x) = y`
79. * @param windowDuration窗口时间
80. * @param slideDuration 滑动时间
81. * @param filterFunc 来赛选一定条件的 key-value 对的
82. */
83. def reduceByKeyAndWindow(
84. reduceFunc: (V, V) => V,
85. invReduceFunc: (V, V) => V,
86. windowDuration: Duration,
87. slideDuration: Duration = self.slideDuration,
88. numPartitions: Int = ssc.sc.defaultParallelism,
89. filterFunc: ((K, V)) => Boolean = null
90. ): DStream[(K, V)] = ssc.withScope {
91. reduceByKeyAndWindow(
92. reduceFunc, invReduceFunc, windowDuration,
93. slideDuration, defaultPartitioner(numPartitions), filterFunc
94. )
95. }
96.
97. /**
98. *通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作
99. * hash分区是采用spark集群,默认的分区树
100. * @param reduceFunc从左到右的reduce 函数
101. * @param invReduceFunc inverse reduce function; such that for all y, invertible x:
102. * `invReduceFunc(reduceFunc(x, y), x) = y`
103. * @param windowDuration窗口时间
104. * @param slideDuration 滑动时间
105. * @param partitioner 每个RDD的分区数.
106. * @param filterFunc 来赛选一定条件的 key-value 对的
107. */
108. def reduceByKeyAndWindow(
109. reduceFunc: (V, V) => V,
110. invReduceFunc: (V, V) => V,
111. windowDuration: Duration,
112. slideDuration: Duration,
113. partitioner: Partitioner,
114. filterFunc: ((K, V)) => Boolean
115. ): DStream[(K, V)] = ssc.withScope {
116.
117. val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
118. val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
119. val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
120. new ReducedWindowedDStream[K, V](
121. self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
122. windowDuration, slideDuration, partitioner
123. )
124. }
1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4.
5.
6. object reduceByWindowOnStreaming {
7.
8. def main(args: Array[String]) {
9. /**
10. * this is test of Streaming operations-----reduceByKeyAndWindow
11. */
12. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
13. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
14.
15. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
16. val sc = new SparkContext(conf)
17. val ssc = new StreamingContext(sc,Seconds(2))
18.
19. //set the Checkpoint directory
20. ssc.checkpoint("/Res")
21.
22. //get the socket Streaming data
23. val socketStreaming = ssc.socketTextStream("master",9999)
24.
25. val data = socketStreaming.map(x =>(x,1))
26. //def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration ): DStream[(K, V)]
27. //val getedData1 = data.reduceByKeyAndWindow(_+_,Seconds(6))
28.
29. val getedData2 = data.reduceByKeyAndWindow(_+_,
30. (a,b) => a+b*0
31. ,Seconds(6),Seconds(2))
32.
33. val getedData1 = data.reduceByKeyAndWindow(_+_,_-_,Seconds(9),Seconds(6))
34.
35. println("reduceByKeyAndWindow : ")
36. getedData1.print()
37.
38. ssc.start()
39. ssc.awaitTermination()
40.
41.
42. }
43. }
這里出现了invReduceFunc函数這个函数有点特别,一不注意就会出错,现在通过分析源码中的
ReducedWindowedDStream這个类内部来进行说明:
------------------reduceByWindow操作---------------------------
1. /输入:reduceFunc、窗口长度、滑动长度
2. //输出:(a,b)为从几个从左到右一次取得两个元素
3. //(,a,b)进入reduceFunc,
4. def reduceByWindow(
5. reduceFunc: (T, T) => T,
6. windowDuration: Duration,
7. slideDuration: Duration
8. ): DStream[T] = ssc.withScope {
9. this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
10. }
11. /**
12. *输入reduceFunc,invReduceFunc,窗口长度、滑动长度
13. */
14. def reduceByWindow(
15. reduceFunc: (T, T) => T,
16. invReduceFunc: (T, T) => T,
17. windowDuration: Duration,
18. slideDuration: Duration
19. ): DStream[T] = ssc.withScope {
20. this.map((1, _))
21. .reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
22. .map(_._2)
23. }
1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4.
5. /**
6. * Created by root on 6/23/16.
7. */
8. object reduceByWindow {
9. def main(args: Array[String]) {
10. /**
11. * this is test of Streaming operations-----reduceByWindow
12. */
13. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
14. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
15.
16. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
17. val sc = new SparkContext(conf)
18. val ssc = new StreamingContext(sc,Seconds(2))
19. //set the Checkpoint directory
20. ssc.checkpoint("/Res")
21.
22. //get the socket Streaming data
23. val socketStreaming = ssc.socketTextStream("master",9999)
24.
25. //val data = socketStreaming.reduceByWindow(_+_,Seconds(6),Seconds(2))
26. val data = socketStreaming.reduceByWindow(_+_,_+_,Seconds(6),Seconds(2))
27.
28.
29. println("reduceByWindow: count the number of elements")
30. data.print()
31.
32.
33. ssc.start()
34. ssc.awaitTermination()
35.
36. }
37. }
-----------------------------------------------countByWindow操作---------------------------------
1. /**
2. * 输入 窗口长度和滑动长度,返回窗口内的元素数量
3. * @param windowDuration 窗口长度
4. * @param slideDuration 滑动长度
5. */
6. def countByWindow(
7. windowDuration: Duration,
8. slideDuration: Duration): DStream[Long] = ssc.withScope {
9. this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
10. //窗口下的DStream进行map操作,把每个元素变为1之后进行reduceByWindow操作
11. }
1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4.
5. /**
6. * Created by root on 6/23/16.
7. */
8. object countByWindow {
9. def main(args: Array[String]) {
10.
11. /**
12. * this is test of Streaming operations-----countByWindow
13. */
14. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
15. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
16.
17. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
18. val sc = new SparkContext(conf)
19. val ssc = new StreamingContext(sc,Seconds(2))
20. //set the Checkpoint directory
21. ssc.checkpoint("/Res")
22.
23. //get the socket Streaming data
24. val socketStreaming = ssc.socketTextStream("master",9999)
25.
26. val data = socketStreaming.countByWindow(Seconds(6),Seconds(2))
27.
28.
29. println("countByWindow: count the number of elements")
30. data.print()
31.
32.
33. ssc.start()
34. ssc.awaitTermination()
35.
36.
37. }
38. }
-------------------------------- countByValueAndWindow-------------
1. /**
2. *输入 窗口长度、滑动时间、RDD分区数(默认分区是等于并行度)
3. * @param windowDuration width of the window; must be a multiple of this DStream's
4. * batching interval
5. * @param slideDuration sliding interval of the window (i.e., the interval after which
6. * the new DStream will generate RDDs); must be a multiple of this
7. * DStream's batching interval
8. * @param numPartitions number of partitions of each RDD in the new DStream.
9. */
10. def countByValueAndWindow(
11. windowDuration: Duration,
12. slideDuration: Duration,
13. numPartitions: Int = ssc.sc.defaultParallelism)
14. (implicit ord: Ordering[T] = null)
15. : DStream[(T, Long)] = ssc.withScope {
16. this.map((_, 1L)).reduceByKeyAndWindow(
17. (x: Long, y: Long) => x + y,
18. (x: Long, y: Long) => x - y,
19. windowDuration,
20. slideDuration,
21. numPartitions,
22. (x: (T, Long)) => x._2 != 0L
23. )
24. }
1. import org.apache.log4j.{Level, Logger}
2. import org.apache.spark.streaming.{Seconds, StreamingContext}
3. import org.apache.spark.{SparkConf, SparkContext}
4.
5. /**
6. * Created by root on 6/23/16.
7. */
8. object countByValueAndWindow {
9. def main(args: Array[String]) {
10. /**
11. * this is test of Streaming operations-----countByValueAndWindow
12. */
13. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
14. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
15.
16. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
17. val sc = new SparkContext(conf)
18. val ssc = new StreamingContext(sc,Seconds(2))
19. //set the Checkpoint directory
20. ssc.checkpoint("/Res")
21.
22. //get the socket Streaming data
23. val socketStreaming = ssc.socketTextStream("master",9999)
24.
25. val data = socketStreaming.countByValueAndWindow(Seconds(6),Seconds(2))
26.
27.
28. println("countByWindow: count the number of elements")
29. data.print()
30.
31.
32. ssc.start()
33. ssc.awaitTermination()
34. }
35.
36. }
附录
链接:http://pan.baidu.com/s/1slkqwBb 密码:d92r