窗⼝计算是流计算的核⼼,窗⼝将流数据切分成有限⼤⼩的“buckets”,我们可以对这个“buckets”中的有
限数据做运算。
在Flink中整体将窗⼝计算按分为两⼤类:keyedstream窗⼝、datastream窗⼝,以下是代码结构:
Keyed Windows:
Non-Keyed Windows:
Window Lifecycle (窗口生命周期)
当有第⼀个元素落⼊到窗⼝中的时候窗⼝就被创建,当时间(⽔位线)越过窗⼝的EndTime的时候,该窗⼝认定为是就绪状态,可以应⽤WindowFunction对窗⼝中的元素进⾏运算。当前的时间(⽔位线)越过了窗⼝的EndTime+allowed lateness时间,该窗⼝会被删除。只有time-based windows 才有⽣命周期的概念,因为Flink还有⼀种类型的窗⼝global window不是基于时间的,因此没有⽣命周期的概念。
Flink文档原文:
In a nutshell, a window is created as soon as the first element that should belong to this window arrives, and the window is completely removed when the time (event or processing time) passes its nd timestamp plus the user-specified allowed lateness (see Allowed Lateness ). Flink guarantees removal only for time-based windows and not for other types, e.g. global windows
例如,采⽤基于Event-Time的窗⼝化策略,该策略每5分钟创建⼀次不重叠(或翻滚)的窗⼝,并允许延迟为1分钟,Flink将为12:00⾄12:05之间的间隔创建⼀个新窗⼝:当带有时间戳的第⼀个元素落⼊此时间间隔时中,且⽔位线经过12:06时间戳时,12:00⾄12:05窗⼝将被删除。
每⼀种窗⼝都有⼀个Trigger和function与之绑定,function的作⽤是⽤于对窗⼝中的内容实现运算。⽽Trigger(触发器)决定了窗⼝什么时候是就绪的,因为只有就绪的窗⼝才会运⽤function做运算。
除了指定以上的策略以外,我们还可以指定 Evictor(剔除器) ,该 Evictor 可以在窗⼝就绪以后且在function运⾏之前或者之后删除窗⼝中的元素。
Keyed vs Non-Keyed Windows
- Keyed Windows:在某⼀个时刻,会触发多个window任务,取决于Key的种类。
- Non-Keyed Windows:因为没有key概念,所以任意时刻只有⼀个window任务执⾏。
Window Assigners(窗口指定)
Window Assigner定义了如何将元素分配给窗⼝,这是通过在window(...) / windowAll()
指定⼀个Window Assigner实现。
Window Assigner负责将接收的数据分配给1~N窗⼝,Flink中预定义了⼀些Window Assigner分如下:tumbling windows , sliding windows , session windows 和 global windows .⽤户可以同过实现WindowAssigner类⾃定义窗⼝。除了global windows 以外其它窗⼝都是基于时间TimeWindow.Timebased窗⼝都有 start timestamp (包含)和end timestamp (排除)属性描述⼀个窗⼝的⼤⼩。
1.Tumbling Windows
滚动窗⼝分配器将每个元素分配给指定窗⼝⼤⼩的窗⼝。滚动窗⼝具有固定的⼤⼩,并且不重叠。例如,如果您指定⼤⼩为5分钟的翻滚窗⼝,则将评估当前窗⼝,并且每五分钟将启动⼀个新窗⼝,如下图所示。
代码片段:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
env.execute("Tumbling Window Stream WordCount")
2.Sliding Windows
滑动窗⼝分配器将元素分配给固定⻓度的窗⼝。类似于滚动窗⼝分配器,窗⼝的⼤⼩由窗⼝⼤⼩参数配置。附加的窗⼝滑动参数控制滑动窗⼝启动的频率。因此,如果幻灯⽚⼩于窗⼝⼤⼩,则滑动窗⼝可能会重叠。在这种情况下,元素被分配给多个窗⼝。
代码片段:
object FlinkProcessingTimeSlidingWindow {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
.aggregate(new UserDefineAggregateFunction)
.print()
//5.执⾏流计算任务
env.execute("Sliding Window Stream WordCount")
}
}
class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),
(String,Int)]{
override def createAccumulator(): (String, Int) = ("",0)
override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) =
{
(value._1,value._2+accumulator._2)
}
override def getResult(accumulator: (String, Int)): (String, Int) = accumulator
override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
(a._1,a._2+b._2)
}
}
3.Session Windows
会话窗⼝分配器按活动会话对元素进⾏分组。与滚动窗⼝和滑动窗⼝相⽐,会话窗⼝不重叠且没有固定的开始和结束时间。相反,当会话窗⼝在⼀定时间段内未接收到元素时(即,发⽣不活动间隙时),它将关闭。
object FlinkGlobalWindow {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(4))
.apply(new UserDefineGlobalWindowFunction)
.print()
//5.执⾏流计算任务
env.execute("Global Window Stream WordCount")
}
}
class UserDefineGlobalWindowFunction extends WindowFunction[(String,Int),
(String,Int),String,GlobalWindow]{
override def apply(key: String,
window: GlobalWindow,
input: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
var sum = input.map(_._2).sum
out.collect((s"${key}",sum))
}
}
Window Functions
定义窗⼝分配器后,我们需要指定要在每个窗⼝上执⾏的计算。这是Window Function的职责,⼀旦系统确定窗⼝已准备好进⾏处理,就可以处理每个窗⼝的元素。窗⼝函数可以是ReduceFunction,AggregateFunction,FoldFunction、ProcessWindowFunction或WindowFunction(古董)之⼀。其中ReduceFunction和AggregateFunction在运⾏效率上⽐ProcessWindowFunction要⾼,因为前俩个⽅法执⾏的是增量计算,只要有数据抵达窗⼝,系统就会调⽤ReduceFunction,AggregateFunction实现增量计算;ProcessWindowFunction在窗⼝触发之前会⼀直缓存接收数据,只有当窗⼝就绪的时候才会对窗⼝中的元素做批量计算,但是该⽅法可以获取窗⼝的元数据信息。但是可以通过将ProcessWindowFunction与
ReduceFunction,AggregateFunction或FoldFunction结合使⽤来获得窗⼝元素的增量聚合以及ProcessWindowFunction接收的其他窗⼝元数据,从⽽减轻这种情况。
ReduceFunction
class UserDefineReduceFunction extends ReduceFunction[(String,Int)]{
override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
println("reduce:"+v1+"\t"+v2)
(v1._1,v2._2+v1._2)
}
}
object FlinkProcessingTimeTumblingWindowWithReduceFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce(new UserDefineReduceFunction)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
}
}
AggregateFunction
class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),
(String,Int)]{
override def createAccumulator(): (String, Int) = ("",0)
override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) =
{
println("add:"+value+"\t"+accumulator)
(value._1,value._2+accumulator._2)
}
override def getResult(accumulator: (String, Int)): (String, Int) = accumulator
override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
println("merge:"+a+"\t"+b)
(a._1,a._2+b._2)
}
}
object FlinkProcessingTimeTumblingWindowWithAggregateFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(new UserDefineAggregateFunction)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
}
}
FoldFunction
class UserDefineFoldFunction extends FoldFunction[(String,Int),(String,Int)]{
override def fold(accumulator: (String, Int), value: (String, Int)): (String, Int) =
{
println("fold:"+accumulator+"\t"+value)
(value._1,accumulator._2+value._2)
}
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.fold(("",0),new UserDefineFoldFunction)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
注意 :FoldFunction不可以⽤在Session Window中
ProcessWindowFunction
class UserDefineProcessWindowFunction extends ProcessWindowFunction[(String,Int),
(String,Int),String,TimeWindow]{
val sdf=new SimpleDateFormat("HH:mm:ss")
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val w = context.window//获取窗⼝元数据
val start =sdf.format(w.getStart)
val end = sdf.format(w.getEnd)
val total=elements.map(_._2).sum
out.collect((key+"\t["+start+"~"+end+"]",total))
}
}
object FlinkProcessingTimeTumblingWindowWithProcessWindowFunction {
def main(args: Array[String]): Unit = {
ProcessWindowFunction & Reduce/Aggregte/Fold
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new UserDefineProcessWindowFunction)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
}
}
ProcessWindowFunction & Reduce/Aggregte/Fold
class UserDefineProcessWindowFunction2 extends ProcessWindowFunction[(String,Int),
(String,Int),String,TimeWindow]{
val sdf=new SimpleDateFormat("HH:mm:ss")
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val w = context.window//获取窗⼝元数据
val start =sdf.format(w.getStart)
val end = sdf.format(w.getEnd)
val list = elements.toList
println("list:"+list)
val total=list.map(_._2).sum
out.collect((key+"\t["+start+"~"+end+"]",total))
}
}
class UserDefineReduceFunction2 extends ReduceFunction[(String,Int)]{
override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
println("reduce:"+v1+"\t"+v2)
(v1._1,v2._2+v1._2)
}
}
object FlinkProcessingTimeTumblingWindowWithReduceFucntionAndProcessWindowFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce(new UserDefineReduceFunction2,new UserDefineProcessWindowFunction2)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
}
}
Per-window state In ProcessWindowFunction
class UserDefineProcessWindowFunction3 extends ProcessWindowFunction[(String,Int),
(String,Int),String,TimeWindow]{
val sdf=new SimpleDateFormat("HH:mm:ss")
var wvsd:ValueStateDescriptor[Int]=_
var gvsd:ValueStateDescriptor[Int]=_
override def open(parameters: Configuration): Unit = {
wvsd=new ValueStateDescriptor[Int]("ws",createTypeInformation[Int])
gvsd=new ValueStateDescriptor[Int]("gs",createTypeInformation[Int])
}
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val w = context.window//获取窗⼝元数据
val start =sdf.format(w.getStart)
val end = sdf.format(w.getEnd)
val list = elements.toList
//println("list:"+list)
val total=list.map(_._2).sum
var wvs:ValueState[Int]=context.windowState.getState(wvsd)
var gvs:ValueState[Int]=context.globalState.getState(gvsd)
wvs.update(wvs.value()+total)
gvs.update(gvs.value()+total)
println("Window Count:"+wvs.value()+"\t"+"Global Count:"+gvs.value())
out.collect((key+"\t["+start+"~"+end+"]",total))
}
}
object FlinkProcessingTimeTumblingWindowWithProcessWindowFunctionState {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
ocketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new UserDefineProcessWindowFunction3)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
}
}
WindowFunction (Legacy)
在某些可以使⽤ProcessWindowFunction的地⽅,您也可以使⽤WindowFunction。这是ProcessWindowFunction的较旧版本,提供的上下⽂信息较少,并且没有某些⾼级功能,例如,每个窗⼝的keyed State。
class UserDefineWindowFunction extends WindowFunction[(String,Int),
(String,Int),String,TimeWindow]{
override def apply(key: String,
window: TimeWindow,
input: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val sdf = new SimpleDateFormat("HH:mm:ss")
var start=sdf.format(window.getStart)
var end=sdf.format(window.getEnd)
var sum = input.map(_._2).sum
out.collect((s"${key}\t${start}~${end}",sum))
}
}
object FlinkProcessingTimeSessionWindowWithWindowFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.apply(new UserDefineWindowFunction)
.print()
//5.执⾏流计算任务
env.execute("Session Window Stream WordCount")
}
}