1、需求
2、需求输出每个窗口访问量最大的5个地址,所以一定是无状态的输出,有两种办法
(1)、keyby后自己使用processfunction自己定义state,否则是有状态的输出
(2)、window后➕windowAll方法
3、正常情况下不考虑乱序的时候(没有窗口延迟关闭1min和延迟数据的侧输出流)
event时间窗口滑动窗口10min,滚动5s输出
程序一启动,就会创建所有的窗口,但是有数据才会输出 [10:15:45-10:25:50), [10:15:50-10:25:55), [10:15:55-10:25:60)
watermark:延时1s
定时器:windowEnd+1s(但是定时器的加1s输出需要依赖watermark的更新,所以在watermark+2s才会触发)
aggDataStream需要窗口触发才会输出
输入的原始数据
输入(在 [10:15:45-10:25:50) ):89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
输出:dataStream> ApacheLogEvent(89.101,-,1421461549000,GET,/presedent)
输入:89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
输出:dataStream> ApacheLogEvent(89.101,-,1421461550000,GET,/presedent)
输入:89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
输出(watermark对应的[10:15:45-10:25:50)窗口触发):
dataStream> ApacheLogEvent(89.101,-,1421461551000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,1)
输入:
89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
输出(定时器触发):
dataStream> ApacheLogEvent(89.101,-,1421461552000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1: 页面url=/presedent 浏览量=1
====================================
输入:10:25:56秒的数据时 [10:15:50-10:25:55)窗口触发,会输出agg流结果
输入:89.101 - - 17/05/2015:10:25:57 +0000 GET /presedent
输出(定时器触发):
dataStream> ApacheLogEvent(89.101,-,1421461557000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461555000,4)
top N> ====================================
时间: 2015-01-17 10:25:55.0
No1: 页面url=/presedent 浏览量=4 包含了[10:15:50-10:25:55)的所有数据
====================================
4、考虑乱序的时候(包含窗口延迟关闭1min和延迟数据的侧输出流)
乱序的三大保证
(1)、watermark的使用
(2)、allowlateness方法使用,窗口延迟关闭,延迟数据依然会进行增量聚合
(3)、输出到侧输出流
event时间窗口滑动窗口10min,滚动5s输出
[10:15:45-10:25:50), [10:15:50-10:25:55), [10:15:55-10:25:60)
watermark:延时1s
定时器:windowEnd+1s(但是定时器的加1s输出需要依赖watermark的更新,所以在watermark+2s才会触发)
alowLatenes 为1min
侧输出流为:late
aggDataStream需要窗口触发才会输出
输入:
输入:89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
输入:89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
输入:89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
输入:89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
前面这4条记录和上述基本一直,除了窗口延迟1min后才会关闭
输入:89.101 - - 17/05/2015:10:25:46 +0000 GET /presedent
输出:根据之前的输入watermark已经为10:25:51了,所以agg触发了,但是定时器需要新的watermark更新才会触发。
dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,2)
输入:89.101 - - 17/05/2015:10:25:53 +0000 GET /presedent
输出更新了watermark,所以定时器触发了:
dataStream> ApacheLogEvent(89.101,-,1421461553000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1: 页面url=/presedent 浏览量=2
====================================
输入:
89.101 - - 17/05/2015:10:25:31 +0000 GET /presedent
输出:原因是允许1min迟到数据,并且有数据,所以输出了,原来没有输出是因为之前没有数据,所以没输出
dataStream> ApacheLogEvent(89.101,-,1421461531000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,3)
aggStream> PageViewCount(/presedent,1421461545000,1)
aggStream> PageViewCount(/presedent,1421461540000,1)
aggStream> PageViewCount(/presedent,1421461535000,1)
当前wartermark为10:25:52s,那么[,10:24:50]的窗口一定已经关闭了,那我们输入[,10:24:49]仍然不会输出到侧输出流里面,因为这条记录还属于[,10:24:55]等多个的窗口,该数据还是会被收进去的,输入侧输出流里面的数据需要该数据属于的所有窗口都已经关闭了,那我们输入一个10min之前的数据就会输出到侧输出流了,因为它的窗口都关闭了。
输入:
89.101 - - 17/05/2015:10:14:51 +0000 GET /presedent
输出:侧输出流就会输出了
5、代码
package flinkProject
import java.sql.Timestamp
import java.text.SimpleDateFormat
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.configuration.ConfigOptions.ListConfigOptionBuilder
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import scala.collection.mutable.ListBuffer
//事件类
case class ApacheLogEvent(ip: String, userid: String, timestamp: Long, methord: String, url: String)
//窗口聚合结果
case class PageViewCount(url: String, windowEnd: Long, count: Long)
object HotNewworkLogFlow {
def main(args: Array[String]): Unit = {
val executionEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
executionEnvironment.setParallelism(1)
executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //watermark周期性生成,默认是200ms
// executionEnvironment.getConfig.setAutoWatermarkInterval(500) //watermark周期性生成,默认是200ms,可以修改为500ms
val stream2: DataStream[String] = executionEnvironment.socketTextStream("127.0.0.1", 1111)
val transforStream: DataStream[ApacheLogEvent] = stream2.map(data => {
val tmpList = data.split(" ")
val simpleDateFormat = new SimpleDateFormat("dd/mm/yy:HH:mm:ss")
val ts = simpleDateFormat.parse(tmpList(3)).getTime
ApacheLogEvent(tmpList(0), tmpList(1), ts, tmpList(5), tmpList(6))
})
//增加watermark配置.punctated代表点状的,periodic代表周期性的(一般用这个)
val transforEventStream: DataStream[ApacheLogEvent] = transforStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[ApacheLogEvent](Time.seconds(1)) {
override def extractTimestamp(t: ApacheLogEvent) = t.timestamp
})
val aggResult: DataStream[PageViewCount] = transforEventStream
.filter(_.methord == "GET")
.keyBy(_.url)
.timeWindow(Time.minutes(10), Time.seconds(5))
.allowedLateness(Time.minutes(1))
.sideOutputLateData(new OutputTag[ApacheLogEvent]("late"))
.aggregate(new PageCountAgg, new PageCountWindow)
transforStream.print("dataStream")
aggResult.print("aggStream")
aggResult.getSideOutput(new OutputTag[ApacheLogEvent]("late")).print()
aggResult.keyBy(_.windowEnd).process(new HotProcessFunction(5)).print("top N")
executionEnvironment.execute("transform")
}
}
class PageCountAgg extends AggregateFunction[ApacheLogEvent, Long, Long] {
override def createAccumulator(): Long = 0l
override def add(value: ApacheLogEvent, accumulator: Long): Long = accumulator + 1
override def getResult(accumulator: Long): Long = accumulator
override def merge(a: Long, b: Long): Long = a + b
}
class PageCountWindow extends WindowFunction[Long, PageViewCount, String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PageViewCount]): Unit = {
out.collect(PageViewCount(key, window.getEnd, input.iterator.next()))
}
}
class HotProcessFunction(topSize: Int) extends KeyedProcessFunction[Long, PageViewCount, String] {
var itemState: ListState[PageViewCount] = _
override def open(parameters: Configuration): Unit = {
super.open(parameters)
// 命名状态变量的名字和状态变量的类型
val itemsStateDesc = new ListStateDescriptor[PageViewCount]("itemState-state", classOf[PageViewCount])
// 定义状态变量
itemState = getRuntimeContext.getListState(itemsStateDesc)
}
override def processElement(i: PageViewCount, context: KeyedProcessFunction[Long, PageViewCount, String]#Context, collector: Collector[String]): Unit = {
itemState.add(i)
context.timerService().registerEventTimeTimer(i.windowEnd + 1)
}
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PageViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
val allPageViewCount: ListBuffer[PageViewCount] = ListBuffer.empty
val iterable = itemState.get().iterator()
while (iterable.hasNext) {
allPageViewCount += iterable.next()
}
itemState.clear()
val sortedItems = allPageViewCount.sortBy(_.count)(Ordering.Long.reverse).take(topSize)
// 将排名信息格式化成 String, 便于打印
val result: StringBuilder = new StringBuilder
result.append("====================================\n")
result.append("时间: ").append(new Timestamp(timestamp - 1)).append("\n")
for (i <- sortedItems.indices) {
val currentItem: PageViewCount = sortedItems(i)
// e.g. No1: 商品ID=12224 浏览量=2413
result.append("No").append(i + 1).append(":")
.append(" 页面url=").append(currentItem.url)
.append(" 浏览量=").append(currentItem.count).append("\n")
}
result.append("====================================\n\n")
// 控制输出频率,模拟实时滚动结果
Thread.sleep(1000)
out.collect(result.toString)
}
}
6、原始数据:
没有alowtness和侧输出的测试数据
89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:56 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:57 +0000 GET /presedent
89.101 - - 17/05/2015:10:26:02 +0000 GET /presedent
有alowtness和侧输出的测试数据
89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:46 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:53 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:31 +0000 GET /presedent
89.101 - - 17/05/2015:10:14:51 +0000 GET /presedent
7、问题:输入迟到数据不能和历史数据重新排序输出
现在解决了乱序问题后,还有个问题,当有1min内的迟到数据到来时,会重新触发agg方法进行聚合,输出一个agg的结果,例如(url1,3)变成了(url1,4),这个时候我们要用(url1,4)替换掉(url1,3)并且进行重新排序,所以要保留这个窗口的所有数据在state中,在该窗口的endTime+1min后再清空状态
所以我们定义mapState和一个新的endTime+1min的定时器
输入迟到数据会和历史数据重新排序输出
输出结果如下:
dataStream> ApacheLogEvent(89.101,-,1421461549000,GET,/presedent)
dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/pre)
dataStream> ApacheLogEvent(89.101,-,1421461551000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,1)
aggStream> PageViewCount(/pre,1421461550000,1)
dataStream> ApacheLogEvent(89.101,-,1421461552000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1: 页面url=/pre 浏览量=1
No2: 页面url=/presedent 浏览量=1
====================================
dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,2)
dataStream> ApacheLogEvent(89.101,-,1421461553000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1: 页面url=/presedent 浏览量=2
No2: 页面url=/pre 浏览量=1
====================================
优化代码如下:
package flinkProject
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util
import java.util.Map
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, MapState, MapStateDescriptor}
import org.apache.flink.configuration.ConfigOptions.ListConfigOptionBuilder
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import scala.collection.mutable.ListBuffer
//事件类
case class ApacheLogEvent(ip: String, userid: String, timestamp: Long, methord: String, url: String)
//窗口聚合结果
case class PageViewCount(url: String, windowEnd: Long, count: Long)
object HotNewworkLogFlow {
def main(args: Array[String]): Unit = {
val executionEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
executionEnvironment.setParallelism(1)
executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //watermark周期性生成,默认是200ms
// executionEnvironment.getConfig.setAutoWatermarkInterval(500) //watermark周期性生成,默认是200ms,可以修改为500ms
val stream2: DataStream[String] = executionEnvironment.socketTextStream("127.0.0.1", 1111)
val transforStream: DataStream[ApacheLogEvent] = stream2.map(data => {
val tmpList = data.split(" ")
val simpleDateFormat = new SimpleDateFormat("dd/mm/yy:HH:mm:ss")
val ts = simpleDateFormat.parse(tmpList(3)).getTime
ApacheLogEvent(tmpList(0), tmpList(1), ts, tmpList(5), tmpList(6))
})
//增加watermark配置.punctated代表点状的,periodic代表周期性的(一般用这个)
val transforEventStream: DataStream[ApacheLogEvent] = transforStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[ApacheLogEvent](Time.seconds(1)) {
override def extractTimestamp(t: ApacheLogEvent) = t.timestamp
})
val aggResult: DataStream[PageViewCount] = transforEventStream
.filter(_.methord == "GET")
.keyBy(_.url)
.timeWindow(Time.minutes(10), Time.seconds(5))
.allowedLateness(Time.minutes(1))
.sideOutputLateData(new OutputTag[ApacheLogEvent]("late"))
.aggregate(new PageCountAgg, new PageCountWindow)
transforStream.print("dataStream")
aggResult.print("aggStream")
aggResult.getSideOutput(new OutputTag[ApacheLogEvent]("late")).print()
aggResult.keyBy(_.windowEnd).process(new HotProcessFunction(5)).print("top N")
executionEnvironment.execute("transform")
}
}
class PageCountAgg extends AggregateFunction[ApacheLogEvent, Long, Long] {
override def createAccumulator(): Long = 0l
override def add(value: ApacheLogEvent, accumulator: Long): Long = accumulator + 1
override def getResult(accumulator: Long): Long = accumulator
override def merge(a: Long, b: Long): Long = a + b
}
class PageCountWindow extends WindowFunction[Long, PageViewCount, String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PageViewCount]): Unit = {
out.collect(PageViewCount(key, window.getEnd, input.iterator.next()))
}
}
class HotProcessFunction(topSize: Int) extends KeyedProcessFunction[Long, PageViewCount, String] {
var itemState: ListState[PageViewCount] = _
var hotUrlState:MapState[String,Long]=_
override def open(parameters: Configuration): Unit = {
super.open(parameters)
// 命名状态变量的名字和状态变量的类型
// val itemsStateDesc = new ListStateDescriptor[PageViewCount]("itemState-state", classOf[PageViewCount])
val hotUrlStateDesc=new MapStateDescriptor[String,Long]("hot-url-state",classOf[String],classOf[Long])
// 定义状态变量
// itemState = getRuntimeContext.getListState(itemsStateDesc)
hotUrlState = getRuntimeContext.getMapState(hotUrlStateDesc)
}
override def processElement(i: PageViewCount, context: KeyedProcessFunction[Long, PageViewCount, String]#Context, collector: Collector[String]): Unit = {
// itemState.add(i)
context.timerService().registerEventTimeTimer(i.windowEnd + 1)
hotUrlState.put(i.url,i.count)
//负责清空状态的定时器,这时窗口已经彻底关闭,不再有聚合结果输出,可以清空状态
context.timerService().registerEventTimeTimer(i.windowEnd + 60000l)
}
//参数timestamp是窗口触发的时间戳,之前注册的每个定时器都会走onTimer方法
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PageViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
// val allPageViewCount: ListBuffer[PageViewCount] = ListBuffer.empty
// val iterable = itemState.get().iterator()
// while (iterable.hasNext) {
// allPageViewCount += iterable.next()
// }
//
// itemState.clear()
if(timestamp==ctx.getCurrentKey+60000l){
hotUrlState.clear()
return
}
val allPageViewCount:ListBuffer[(String,Long)]=ListBuffer()
val iterable: util.Iterator[Map.Entry[String, Long]] = hotUrlState.entries().iterator()
while (iterable.hasNext){
val map: Map.Entry[String, Long] = iterable.next()
allPageViewCount+=((map.getKey,map.getValue))
}
val sortedItems = allPageViewCount.sortBy(_._2)(Ordering.Long.reverse).take(topSize)
// 将排名信息格式化成 String, 便于打印
val result: StringBuilder = new StringBuilder
result.append("====================================\n")
result.append("时间: ").append(new Timestamp(timestamp - 1)).append("\n")
for (i <- sortedItems.indices) {
val currentItem: (String,Long) = sortedItems(i)
// e.g. No1: 商品ID=12224 浏览量=2413
result.append("No").append(i + 1).append(":")
.append(" 页面url=").append(currentItem._1)
.append(" 浏览量=").append(currentItem._2).append("\n")
}
result.append("====================================\n\n")
// 控制输出频率,模拟实时滚动结果
Thread.sleep(1000)
out.collect(result.toString)
}
}