Flink Windows
- Keyed Windows
- Window Lifecycle 窗口生命周期
- Keyed vs Non-Keyed Windows
- Window Assigners 窗口指定
- Tumbling Windows 滚动窗口
- Sliding Windows 滑动窗口
- Session Windows 会话窗口
- Global Windows 全局窗口
- Window Functions
- ReduceFunction
- AggregateFunction
- FoldFunction
- ProcessWindowFunction
- ProcessWindowFunction & Reduce/Aggregte/Fold
- Per-window state In Process Window Function 进程窗口函数中的每个窗口状态
- Trigger(触发器)
- DeltaTrigger 延时触发器
- Evictors(剔除器)
- CountEvictor 计数驱逐
- DeltaEvictor 增量驱逐
- TimeEvictor
- UserDefineEvictor 用户来自定义驱逐
- EventTime Window 事件时间窗口
- Watermarker 水位线
- 测试案例
- 迟到数据
- 补充下自己对最大乱序时间和迟到时间区别的理解
窗⼝计算是流计算的核⼼,窗⼝将流数据切分成有限⼤⼩的“buckets”(水桶),我们可以对这个“buckets”中的有限数据做运算。
在Flink中整体将窗⼝计算按分为两⼤类:
keyedstream窗⼝
、datastream窗⼝
,以下是代码结构:
Keyed Windows
stream
.keyBy(…) <- keyed versus non-keyed windows
.window(…) <- 必须指定: “window assigner”
[.trigger(…)] <- 可选: “trigger” (else default trigger) 决定了窗⼝何时触发计算
[.evictor(…)] <- 可选: “evictor” (else no evictor) 剔除器,剔除窗⼝内的元素
[.allowedLateness(…)] <- 可选: “lateness” (else zero) 是否允许有迟到
[.sideOutputLateData(…)] <- 可选: “output tag” (else no side output for late
data)
.reduce/aggregate/fold/apply() <- 必须: “Window Function” 对窗⼝的数据做运算
[.getSideOutput(…)] <- 可选: “output tag” 获取迟到的数据Non-Keyed Windows
stream
.windowAll(…) <- 必须指定: “window assigner”
[.trigger(…)] <- 可选: “trigger” (else default trigger) 决定了窗⼝何时触发计算
[.evictor(…)] <- 可选: “evictor” (else no evictor) 剔除器,剔除窗⼝内的元素
[.allowedLateness(…)] <- 可选: “lateness” (else zero) 是否允许有迟到
[.sideOutputLateData(…)] <- 可选: “output tag” (else no side output for late
data)
.reduce/aggregate/fold/apply() <- 必须: “Window Function” 对窗⼝的数据做运算
[.getSideOutput(…)] <- 可选: “output tag” 获取迟到的数据
Window Lifecycle 窗口生命周期
当有第⼀个元素落⼊到窗⼝中的时候窗⼝就被创建,当时间(⽔位线)越过窗⼝的EndTime(结束时间)
的时候,该窗⼝认定为是就绪状态,可以应⽤WindowFunction对窗⼝中的元素进⾏运算。当前的时间(⽔位线)越过了窗⼝的EndTime+allowed lateness(允许迟到)
时间,该窗⼝会被删除。只有time-based windows(基于时间的窗口)
才有⽣命周期的概念,因为Flink还有⼀种类型的窗⼝global window不是基于时间的,因此没有⽣命周期的概念。
简而言之,当应该属于这个窗口的第一个元素到达时,就会创建一个窗口,当时间(事件或处理时间)超过它的结束时间戳加上用户指定的允许延迟(参见允许延迟)时,窗口就会被完全删除。Flink保证只对基于时间的窗口进行删除,而对其他类型(例如全局窗口)不进行删除
例如,采⽤基于Event-Time的窗⼝化策略,该策略每5分钟创建⼀次不重叠(或翻滚)的窗⼝,并允许延迟为1分钟,Flink将为12:00⾄12:05之间的间隔创建⼀个新窗⼝:当带有时间戳的第⼀个元素落⼊此时间间隔时中,且⽔位线经过12:06时间戳时,12:00⾄12:05窗⼝将被删除。
每⼀种窗⼝都有⼀个Trigger(触发器)
和function
与之绑定,function的作⽤是⽤于对窗⼝中的内容实现运算。⽽Trigger决定了窗⼝什么时候是就绪的,因为只有就绪的窗⼝才会运⽤function做运算。
此外,每个窗口将有一个触发器(参见触发器)和一个函数(
ProcessWindowFunction
,ReduceFunction
,AggregateFunction
或FoldFunction
)(参见窗口函数)附加到它。该函数将包含要应用于窗口内容的计算,而触发器指定了认为窗口已准备好应用该函数的条件。
除了指定以上的策略以外,我们还可以指定 Evictor(清除者)
,该Evictor
可以在窗⼝就绪以后且在function运⾏之前或者之后删除窗⼝中的元素。
Keyed vs Non-Keyed Windows
- Keyed Windows: 在某⼀个时刻,会触发多个window任务,取决于Key的种类。
- Non-Keyed Windows: 因为没有key概念,所以任意时刻只有⼀个window任务执⾏。
在键控流的情况下,传入事件的任何属性都可以用作键(这里有更多详细信息)。拥有一个键控流将允许您的窗口计算由多个任务并行执行,因为每个逻辑键控流都可以独立于其他流进行处理。所有引用相同键的元素将被发送到相同的并行任务。
在非键流的情况下,你的原始流将不会被分割成多个逻辑流,所有的窗口逻辑将由一个任务执行,即并行度为1。
Window Assigners 窗口指定
Window Assigner定义了如何将元素分配给窗⼝,这是通过在 window(...) / windowAll()
指定⼀个Window Assigner实现。
Window Assigner负责将接收的数据分配给1~N窗⼝,Flink中预定义了⼀些Window Assigner分如下:tumbling windows (滚动窗口)
, sliding windows(滑行窗口)
,session windows(会话窗口)
和 global windows(全局窗口)
.⽤户可以同过实现WindowAssigner类⾃定义窗⼝。除了global windows 以外其它窗⼝都是基于时间的TimeWindow.Timebased窗⼝都有 start timestamp
(包含)和end timestamp
(排除)属性描述⼀个窗⼝的⼤⼩。
在代码中,Flink在使用时使用TimeWindow基于时间的窗口,它有查询开始和结束时间戳的方法,还有一个额外的方法maxTimestamp()
,该方法返回给定窗口允许的最大时间戳。
Tumbling Windows 滚动窗口
- 滚动窗⼝分配器将每个元素分配给指定窗⼝⼤⼩的窗⼝。滚动窗⼝具有固定的⼤⼩,并且不重叠。例如,如果您指定⼤⼩为5分钟的翻滚窗⼝,则将评估当前窗⼝,并且每五分钟将启动⼀个新窗⼝,如下图所示。
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
env.execute("Tumbling Window Stream WordCount")
Sliding Windows 滑动窗口
- 滑动窗⼝分配器将元素分配给固定⻓度的窗⼝。类似于滚动窗⼝分配器,窗⼝的⼤⼩由窗⼝⼤⼩参数配置。附加的窗⼝滑动参数控制滑动窗⼝启动的频率。因此,如果幻灯⽚⼩于窗⼝⼤⼩,则滑动窗⼝可能会重叠。在这种情况下,元素被分配给多个窗⼝。
object FlinkProcessingTimeSlidingWindow {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
.aggregate(new UserDefineAggregateFunction)//自定义聚合
.print()
//5.执⾏流计算任务
env.execute("Sliding Window Stream WordCount")
}
}
//用户定义聚合函数
class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),
(String,Int)]{
override def createAccumulator(): (String, Int) = ("",0)
override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) =
{
(value._1,value._2+accumulator._2)
}
override def getResult(accumulator: (String, Int)): (String, Int) = accumulator
override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
(a._1,a._2+b._2)
}
Session Windows 会话窗口
会话窗⼝分配器按活动会话对元素进⾏分组。与滚动窗⼝和滑动窗⼝相⽐,会话窗⼝不重叠且没有固定的开始和结束时间。相反,当会话窗⼝在⼀定时间段内未接收到元素时(即,发⽣不活动间隙时),它将关闭。
object FlinkProcessingTimeSessionWindow {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.apply(new UserDefineWindowFunction)
.print()
//5.执⾏流计算任务
env.execute("Session Window Stream WordCount")
}
}
//用户定义窗口功能
class UserDefineWindowFunction extends WindowFunction[(String,Int),
(String,Int),String,TimeWindow]{
override def apply(key: String, window: TimeWindow,input: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val sdf = new SimpleDateFormat("HH:mm:ss")
var start=sdf.format(window.getStart)
var end=sdf.format(window.getEnd)
var sum = input.map(_._2).sum
out.collect((s"${key}\t${start}~${end}",sum))
}
}
Global Windows 全局窗口
全局窗⼝分配器将具有相同键的所有元素分配给同⼀单个全局窗⼝。仅当您还指定⾃定义触发器时,此窗⼝⽅案才有⽤。否则,将不会执⾏任何计算,因为全局窗⼝没有可以处理聚合元素的⾃然终点。
object FlinkGlobalWindow {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(4))//计算触发
.apply(new UserDefineGlobalWindowFunction)
.print()
//5.执⾏流计算任务
env.execute("Global Window Stream WordCount")
}
}
class UserDefineGlobalWindowFunction extends WindowFunction[(String,Int),
(String,Int),String,GlobalWindow]{
override def apply(key: String,window: GlobalWindow,input: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
var sum = input.map(_._2).sum
out.collect((s"${key}",sum))
}
}
Window Functions
定义窗⼝分配器后,我们需要指定要在每个窗⼝上执⾏的计算。这是Window Function的职责,⼀旦系统确定窗⼝已准备好进⾏处理,就可以处理每个窗⼝的元素。窗⼝函数可以是ReduceFunction,AggregateFunction,FoldFunction、ProcessWindowFunction或WindowFunction
(古董)之⼀。其中ReduceFunction和AggregateFunction在运⾏效率上⽐ProcessWindowFunction要⾼,因为前俩个⽅法执⾏的是增量计算,只要有数据抵达窗⼝,系统就会调⽤ReduceFunction,AggregateFunction实现增量计算;ProcessWindowFunction在窗⼝触发之前会⼀直缓存接收数据,只有当窗⼝就绪的时候才会对窗⼝中的元素做批量计算,但是该⽅法可以获取窗⼝的元数据信息。但是可以通过将ProcessWindowFunction与ReduceFunction,AggregateFunction或FoldFunction结合使⽤来获得窗⼝元素的增量聚合以及
ProcessWindowFunction接收的其他窗⼝元数据,从⽽减轻这种情况。
ReduceFunction
object FlinkProcessingTimeTumblingWindowWithReduceFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce(new UserDefineReduceFunction)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
}
}
class UserDefineReduceFunction extends ReduceFunction[(String,Int)]{
override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
println("reduce:"+v1+"\t"+v2)
(v1._1,v2._2+v1._2)
}
}
AggregateFunction
object FlinkProcessingTimeTumblingWindowWithAggregateFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(new UserDefineAggregateFunction)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
}
}
class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),
(String,Int)]{
override def createAccumulator(): (String, Int) = ("",0)
override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) =
{
println("add:"+value+"\t"+accumulator)
(value._1,value._2+accumulator._2)
}
override def getResult(accumulator: (String, Int)): (String, Int) = accumulator
override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
println("merge:"+a+"\t"+b)
(a._1,a._2+b._2)
}
}
FoldFunction
注意
: FoldFunction不可以⽤在Session Window中
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.fold(("",0),new UserDefineFoldFunction)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
class UserDefineFoldFunction extends FoldFunction[(String,Int),(String,Int)]{
override def fold(accumulator: (String, Int), value: (String, Int)): (String, Int) =
{
println("fold:"+accumulator+"\t"+value)
(value._1,accumulator._2+value._2)
}
}
ProcessWindowFunction
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new UserDefineProcessWindowFunction)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
class UserDefineProcessWindowFunction extends ProcessWindowFunction[(String,Int),
(String,Int),String,TimeWindow]{
val sdf=new SimpleDateFormat("HH:mm:ss")
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val w = context.window//获取窗⼝元数据
val start =sdf.format(w.getStart)
val end = sdf.format(w.getEnd)
val total=elements.map(_._2).sum
out.collect((key+"\t["+start+"~"+end+"]",total))
}
}
ProcessWindowFunction & Reduce/Aggregte/Fold
object FlinkProcessingTimeTumblingWindowWithReduceFucntionAndProcessWindowFunction {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce(new UserDefineReduceFunction2,new UserDefineProcessWindowFunction2)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
}
}
class UserDefineProcessWindowFunction2 extends ProcessWindowFunction[(String,Int),
(String,Int),String,TimeWindow]{
val sdf=new SimpleDateFormat("HH:mm:ss")
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val w = context.window//获取窗⼝元数据
val start =sdf.format(w.getStart)
val end = sdf.format(w.getEnd)
val list = elements.toList
println("list:"+list)
val total=list.map(_._2).sum
out.collect((key+"\t["+start+"~"+end+"]",total))
}
}
class UserDefineReduceFunction2 extends ReduceFunction[(String,Int)]{
override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
println("reduce:"+v1+"\t"+v2)
(v1._1,v2._2+v1._2)
}
}
Per-window state In Process Window Function 进程窗口函数中的每个窗口状态
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new UserDefineProcessWindowFunction3)
.print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")
class UserDefineProcessWindowFunction3 extends ProcessWindowFunction[(String,Int),
(String,Int),String,TimeWindow]{
val sdf=new SimpleDateFormat("HH:mm:ss")
var wvsd:ValueStateDescriptor[Int]=_
var gvsd:ValueStateDescriptor[Int]=_
override def open(parameters: Configuration): Unit = {
wvsd=new ValueStateDescriptor[Int]("ws",createTypeInformation[Int])
gvsd=new ValueStateDescriptor[Int]("gs",createTypeInformation[Int])
}
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val w = context.window//获取窗⼝元数据
val start =sdf.format(w.getStart)
val end = sdf.format(w.getEnd)
val list = elements.toList
//println("list:"+list)
val total=list.map(_._2).sum
var wvs:ValueState[Int]=context.windowState.getState(wvsd)
var gvs:ValueState[Int]=context.globalState.getState(gvsd)
wvs.update(wvs.value()+total)
gvs.update(gvs.value()+total)
println("Window Count:"+wvs.value()+"\t"+"Global Count:"+gvs.value())
out.collect((key+"\t["+start+"~"+end+"]",total))
}
}
Trigger(触发器)
Trigger
决定了什么时候窗⼝准备就绪了,⼀旦窗⼝准备就绪就可以使⽤WindowFunction进⾏计算。每⼀个 WindowAssigner 都会有⼀个默认的Trigger
。如果默认的Trigger不满⾜⽤户的需求⽤户可以⾃定义Trigger。
窗⼝类型 | 触发器 | 触发时机 |
EventTime(Tumblng/Sliding/Session) | EventTimeTrigger | ⼀旦Watermarker(水印)没过窗⼝的EndTime,该窗⼝认定为就绪 |
ProcessingTime(Tumblng/Sliding/Session) | ProcessingTimeTrigger | ⼀旦计算节点系统时钟没过窗⼝的EndTime,该触发器便触发 |
GlobalWindow | NeverTrigger | 永远不触发。 |
- 触发器接⼝具有五种⽅法,这些⽅法允许触发器对不同事件做出反应:
public abstract class Trigger<T, W extends Window> implements Serializable {
/**
只要有元素落⼊到当前窗⼝, 就会调⽤该⽅法
* @param element 收到的元素
* @param timestamp 元素抵达时间.
* @param window 元素所属的window窗⼝.
* @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
*/
public abstract TriggerResult onElement(T element, long timestamp, W window,
TriggerContext ctx) throws Exception;
/**
* processing-time 定时器回调函数
*
* @param time 定时器触发的时间.
* @param window 定时器触发的窗⼝对象.
* @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
*/
public abstract TriggerResult onProcessingTime(long time, W window, TriggerContext
ctx) throws Exception;
/**
* event-time 定时器回调函数
*
* @param time 定时器触发的时间.
* @param window 定时器触发的窗⼝对象.
* @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
*/
public abstract TriggerResult onEventTime(long time, W window, TriggerContext ctx)
throws Exception;
/**
* 当 多个窗⼝合并到⼀个窗⼝的时候,调⽤该⽅法,例如系统SessionWindow
* {@link org.apache.flink.streaming.api.windowing.assigners.WindowAssigner}.
*
* @param window 合并后的新窗⼝对象
* @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime)回调以及访问
状态
*/
public void onMerge(W window, OnMergeContext ctx) throws Exception {
throw new UnsupportedOperationException("This trigger does not support merging.");
}
/**
* 当窗⼝被删除后执⾏所需的任何操作。例如:可以清除定时器或者删除状态数据
*/
public abstract void clear(W window, TriggerContext ctx) throws Exception;
}
关于上述⽅法,需要注意两件事:
- 1)前三个⽅法决定如何通过返回TriggerResult来决定窗⼝是否就绪
public enum TriggerResult {
/**
* 不触发,也不删除元素
*/
CONTINUE(false, false),
/**
* 触发窗⼝,窗⼝触发后删除窗⼝中的元素
*/
FIRE_AND_PURGE(true, true),
/**
* 触发窗⼝,但是保留窗⼝元素
*/
FIRE(true, false),
/**
* 不触发窗⼝,丢弃窗⼝,并且删除窗⼝的元素
*/
PURGE(false, true);
private final boolean fire;//是否触发窗⼝
private final boolean purge;//是否清除窗⼝元素
...
}
- 2)这些⽅法中的任何⼀种都可以⽤于注册处理或事件时间计时器以⽤于将来的操作:
案例使⽤
object FlinkTumblingWindowWithCountTrigger {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.trigger(new UserDefineCountTrigger(4L))
.apply(new UserDefineGlobalWindowFunction)
.print()
//5.执⾏流计算任务
env.execute("Global Window Stream WordCount")
}
}
class UserDefineCountTrigger(maxCount: Long) extends Trigger[String, TimeWindow] {
var rsd: ReducingStateDescriptor[Long] = new ReducingStateDescriptor[Long]("rsd", new
ReduceFunction[Long] {
override def reduce(value1: Long, value2: Long): Long = value1 + value2
}, createTypeInformation[Long])
override def onElement(element: String, timestamp: Long, window: TimeWindow, ctx:
Trigger.TriggerContext): TriggerResult = {
val state: ReducingState[Long] = ctx.getPartitionedState(rsd)
state.add(1L)
if (state.get() >= maxCount) {
state.clear()
return TriggerResult.FIRE_AND_PURGE
} else {
return TriggerResult.CONTINUE
}
}
override def onProcessingTime(time: Long, window: TimeWindow, ctx:
Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE
override def onEventTime(time: Long, window: TimeWindow, ctx:
Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE
override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = {
println("==clear==")
ctx.getPartitionedState(rsd).clear()
}
}
DeltaTrigger 延时触发器
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.functions.windowing.delta.DeltaFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessAllWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows
import org.apache.flink.streaming.api.windowing.triggers.{Trigger, TriggerResult}
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow
import org.apache.flink.util.Collector
import scala.tools.nsc.Global
object DeltaTriggerTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(2)
val text = env.socketTextStream("CentOS",9998)
text.flatMap(_.split(" "))
.windowAll(GlobalWindows.create())
.trigger(new UserDefinedDeltaTrigger(new DeltaFunction[String] {
override def getDelta(oldDataPoint: String, newDataPoint: String): Double = {
newDataPoint.length - oldDataPoint.length
}
},4))
.process(new UserDefinedDeltaProcessAllWindowFunction)
.print()
env.execute()
Double
}
}
//自定义ProcessAllWindowFunction
class UserDefinedDeltaProcessAllWindowFunction extends ProcessAllWindowFunction[String,String,GlobalWindow] {
override def process(context: Context, elements: Iterable[String], out: Collector[String]): Unit = {
val e = elements.toList.mkString(" | ")
out.collect(e)
}
}
//自定义trigger
class UserDefinedDeltaTrigger(deltaFunction: DeltaFunction[String],delta:Double) extends Trigger[String,GlobalWindow] {
//状态描述符
private val vsd = new ValueStateDescriptor[String]("value",createTypeInformation[String])
override def onElement(element: String, timestamp: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
//拿到状态
val value: ValueState[String] = ctx.getPartitionedState(vsd)
if(value.value() == null){
//第一次进来
value.update(element)
TriggerResult.CONTINUE
}
if(deltaFunction.getDelta(value.value(),element) >= delta){
//字母数量超过了前一次输入的delata
value.clear()
TriggerResult.FIRE
}else{
//没有超过
value.update(element)
TriggerResult.CONTINUE
}
}
override def onProcessingTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE
override def onEventTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE
override def clear(window: GlobalWindow, ctx: Trigger.TriggerContext): Unit = {
println("====clear====")
ctx.getPartitionedState(vsd).clear()
}
}
Evictors(剔除器)
Flink的窗⼝模型允许除了WindowAssigner和Trigger之外还指定⼀个可选的Evictor。可以使⽤evictor(…)⽅法来完成此操作。Evictors可以在触发器触发后,应⽤Window Function之前或之后从窗⼝中删除元素。
public interface Evictor<T, W extends Window> extends Serializable {
/**
* 在调⽤windowing function之前被调⽤.
*
* @param 当前窗⼝中的所有元素
* @param size 当前窗⼝元素的总数
* @param window The {@link Window}
* @param evictorContext Evictor上下⽂对象
*/
void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window,
EvictorContext evictorContext);
/**
* 在调⽤ windowing function之后调⽤.
*
* @param elements The elements currently in the pane.
* @param size The current number of elements in the pane.
* @param window The {@link Window}
* @param evictorContext The context for the Evictor
*/
void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window,
EvictorContext evictorContext);
}
evictBefore()包含要在窗⼝函数之前应⽤的剔除逻辑,⽽evictA"er()包含要在窗⼝函数之后应⽤的剔除逻辑。应⽤窗⼝功能之前剔除的元素将不会被其处理。
- Flink附带了三个预先实施的驱逐程序。这些是:
CountEvictor 计数驱逐
从窗⼝中保留⽤户指定数量的元素,并从窗⼝缓冲区的开头丢弃其余的元素。
private void evict(Iterable<TimestampedValue<Object>> elements, int size,
EvictorContext ctx) {
if (size <= maxCount) {
return;
} else {
int evictedCount = 0;
for (Iterator<TimestampedValue<Object>> iterator = elements.iterator();
iterator.hasNext();){
iterator.next();
evictedCount++;
if (evictedCount > size - maxCount) {
break;
} else {
iterator.remove();
}
}
}
}
DeltaEvictor 增量驱逐
采⽤DeltaFunction和阈值,计算窗⼝缓冲区中最后⼀个元素与其余每个元素之间的增量,并删除增量⼤于或等于阈值的元素。
private void evict (Iterable < TimestampedValue < T >> elements, int size, EvictorContext
ctx) {
TimestampedValue < T > lastElement = Iterables.getLast(elements);
for (Iterator < TimestampedValue < T >> iterator
= elements.iterator();
iterator.hasNext();
)
{
TimestampedValue < T > element = iterator.next();
//如果最后⼀个元素和前⾯元素差值⼤于threshold
if (deltaFunction.getDelta(element.getValue(), lastElement.getValue()) >=
this.threshold) {
iterator.remove();
}
}
}
TimeEvictor
以毫秒为单位的间隔作为参数,对于给定的窗⼝,它将在其元素中找到最⼤时间戳max_ts,并删除所有时间戳⼩于max_ts — interval(间隔)的元素。- 只要最新的⼀段时间间隔的数据
private void evict (Iterable < TimestampedValue < Object >> elements, int size,
EvictorContext ctx) {
if (!hasTimestamp(elements)) {//没有时间戳就直接返回
return;
}
//获取最⼤时间戳
long currentTime = getMaxTimestamp(elements);
long evictCutoff = currentTime - windowSize;
for (Iterator < TimestampedValue < Object >> iterator
= elements.iterator();
iterator.hasNext();
)
{
TimestampedValue < Object > record = iterator.next();
if (record.getTimestamp() <= evictCutoff) {
iterator.remove();
}
}
}
//判断是否是时间戳
private boolean hasTimestamp (Iterable < TimestampedValue < Object >> elements) {
Iterator < TimestampedValue < Object >> it = elements.iterator();
if (it.hasNext()) {
return it.next().hasTimestamp();
}
return false;
}
//算出最大的时间戳
private long getMaxTimestamp (Iterable < TimestampedValue < Object >> elements) {
long currentTime = Long.MIN_VALUE;
for (Iterator < TimestampedValue < Object >> iterator
= elements.iterator();
iterator.hasNext();
)
{
TimestampedValue < Object > record = iterator.next();
currentTime = Math.max(currentTime, record.getTimestamp());
}
return currentTime;
}
UserDefineEvictor 用户来自定义驱逐
object FlinkSlidingWindowWithUserDefineEvictor {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts =
text.windowAll(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
.evictor(new UserDefineEvictor(false,"error"))
.apply(new UserDefineSlidingWindowFunction)
.print()
//5.执⾏流计算任务
env.execute("Sliding Window Stream WordCount")
}
}
public class UserDefineEvictor implements Evictor<String, TimeWindow> {
private Boolean isEvictorAfter = false; //是否是驱除后
private String excludeContent = null; //排除的内容
//构造
public UserDefineEvictor(Boolean isEvictorAfter, String excludeContent) {
this.isEvictorAfter = isEvictorAfter;
this.excludeContent = excludeContent;
}
@Override//驱逐之前
public void evictBefore(Iterable<TimestampedValue<String>> elements, int size,
TimeWindow window, EvictorContext evictorContext) {
if (!isEvictorAfter) {
evict(elements, size, window, evictorContext);
}
}
@Override//驱逐后
public void evictAfter(Iterable<TimestampedValue<String>> elements, int size,
TimeWindow window, EvictorContext evictorContext) {
if (isEvictorAfter) {
evict(elements, size, window, evictorContext);
}
}
//驱逐
private void evict(Iterable<TimestampedValue<String>> elements, int size,
TimeWindow window, EvictorContext evictorContext) {
for (Iterator<TimestampedValue<String>> iterator =
elements.iterator(); iterator.hasNext(); ) {
TimestampedValue<String> element = iterator.next();
//将含有相关内容元素删除
System.out.println(element.getValue());
if (element.getValue().contains(excludeContent)) {
iterator.remove();
}
}
}
}
class UserDefineSlidingWindowFunction extends AllWindowFunction[String, String, TimeWindow] {
override def apply(window: TimeWindow,
input: Iterable[String],
out: Collector[String]): Unit = {
val sdf = new SimpleDateFormat("HH:mm:ss")
var start = sdf.format(window.getStart)
var end = sdf.format(window.getEnd)
var windowContent = input.toList
println("window:" + start + "\t" + end + " " + windowContent.mkString(" | "))
}
}
EventTime Window 事件时间窗口
Flink流计算传输中⽀持多种时间概念:
- ProcessingTime:处理时间
- EventTime:事件时间
- IngestionTime:摄入时间
- 如果Flink在使⽤的时候不做特殊设定,默认使⽤的是ProcessingTime。其中和ProcessingTime类似IngestionTime都是由系统⾃动产⽣,不同的是IngestionTime是由DataSource源产⽣⽽ProcessingTime由计算算⼦产⽣。因此以上两种时间策略都不能很好的表达在流计算中事件产⽣时间(考虑⽹络传输延迟)。
- Flink中⽀持基于EventTime语义的窗⼝计算,Flink会使⽤Watermarker机制去衡量事件时间推进进度。Watermarker(水位线)会在做为数据流的⼀部分随着数据⽽流动。Watermarker包含有⼀个时间
t
,这就表明流中不会再有事件时间t'<=t
的元素存在
Watermarker(t)= Max event time seen by Procee Node - MaxAllowOrderless
水位线=最大可看到事件时间 - 最大乱序时间
- 最大乱序时间: 最大乱序时间是 水位线距离看到的最大事件时间的差,因为只有水位线没过endTime 才会触发 而这个最大乱序时间 就是给那些 因网络原因晚到的数据准备的时间。
Watermarker 水位线
在Flink中常⻅的⽔位线的计算⽅式:
- 固定频次计算⽔位线(推荐)
class UserDefineAssignerWithPeriodicWatermarks extends AssignerWithPeriodicWatermarks[(String,Long)] {
var maxAllowOrderness=2000L //最大乱序时间
//能看见的最大事件时间
var maxSeenEventTime= 0L //不可以取Long.MinValue 因为当它减去一个数时可能会发生二进制进位 变成一个非常大的正数
var sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
//系统定期的调⽤ 计算当前的⽔位线的值
override def getCurrentWatermark: Watermark = {
new Watermark(maxSeenEventTime-maxAllowOrderness)
}
//更新⽔位线的值,同时提取EventTime
override def extractTimestamp(element: (String, Long), previousElementTimestamp:
Long): Long = {
//始终将最⼤的时间返回
maxSeenEventTime=Math.max(maxSeenEventTime,element._2)
println("ET:"+(element._1,sdf.format(element._2))+" WM:"+sdf.format(maxSeenEventTime-maxAllowOrderness))
element._2
}
}
- Per Event (每个事件)计算⽔位线 (不推荐)
class UserDefineAssignerWithPunctuatedWatermarks extends AssignerWithPunctuatedWatermarks[(String,Long)] {
var maxAllowOrderness=2000L
var maxSeenEventTime= Long.MinValue
var sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
//每接收⼀条记录系统计算⼀次 检查并得到下一个水印
override def checkAndGetNextWatermark(lastElement: (String, Long),
extractedTimestamp: Long): Watermark = {
maxSeenEventTime=Math.max(maxSeenEventTime,lastElement._2)
println("ET:"+(lastElement._1,sdf.format(lastElement._2))+" WM:"+sdf.format(maxSeenEventTime-maxAllowOrderness))
new Watermark(maxSeenEventTime-maxAllowOrderness)
}
//提取时间戳
override def extractTimestamp(element: (String, Long), previousElementTimestamp:
Long): Long = {
//始终将最⼤的时间返回
element._2
}
}
测试案例
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) //⽅便测试将并⾏度设置为 1
//默认时间特性是ProcessingTime,需要设置为EventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置定期调⽤⽔位线频次 1s
// env.getConfig.setAutoWatermarkInterval(1000)
//字符 时间戳
env.socketTextStream("CentOS", 9999)
.map(line => line.split("\\s+"))
.map(ts => (ts(0), ts(1).toLong))
.assignTimestampsAndWatermarks(new
UserDefineAssignerWithPeriodicWatermarks)
.windowAll(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply(new UserDefineAllWindowFucntion)
.print("输出")
env.execute("Tumbling Event Time Window Stream")
注意:
迟到数据
在Flink中,⽔位线⼀旦没过窗⼝的EndTime,这个时候如果还有数据落⼊到已经被⽔位线淹没的窗⼝,我定义该数据为迟到的数据。这些数据在Spark是没法进⾏任何处理的。在Flink中⽤户可以定义窗⼝元素的迟到时间t’。
- 如果Watermarker时间t < 窗⼝EndTime t’’ + t’ 则该数据还可以参与窗⼝计算。
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)//⽅便测试将并⾏度设置为 1
//默认时间特性是ProcessingTime,需要设置为EventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置定期调⽤⽔位线频次 1s
// env.getConfig.setAutoWatermarkInterval(1000)
//字符 时间戳
env.socketTextStream("CentOS", 9999)
.map(line=>line.split("\\s+"))
.map(ts=>(ts(0),ts(1).toLong))
.assignTimestampsAndWatermarks(new
UserDefineAssignerWithPeriodicWatermarks)
.windowAll(TumblingEventTimeWindows.of(Time.seconds(2)))
//允许迟到的时间
.allowedLateness(Time.seconds(2))
.apply(new UserDefineAllWindowFucntion)
.print("输出")
env.execute("Tumbling Event Time Window Stream")
我们设置的窗口为2秒,2秒结束后,窗口计算会被触发,生成第一个计算结果。allowedLateness设置窗口结束后还要等待长为lateness的时间,某个迟到元素的Event Time大于窗口结束时间但是小于结束时间+lateness,该元素仍然会被加入到该窗口中。每新到一个迟到数据,迟到数据被加入ProcessWindowFunction的缓存中,窗口的Trigger会触发一次FIRE,窗口函数被重新调用一次,计算结果得到一次更新。
- 如果Watermarker时间t >= 窗⼝EndTime t’’ + t’ 则该数据默认情况下Flink会丢弃。当然⽤户可以将toolate数据通过side out输出获取,⼀遍⽤户知道哪些迟到的数据没能加⼊正常计算。
object FlinkEventTimeTumblingWindowTooLateData {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)//⽅便测试将并⾏度设置为 1
//默认时间特性是ProcessingTime,需要设置为EventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置定期调⽤⽔位线频次 1s
// env.getConfig.setAutoWatermarkInterval(1000)
val lateTag= new OutputTag[(String,Long)]("late")
//字符 时间戳
var result=env.socketTextStream("CentOS", 9999)
.map(line=>line.split("\\s+"))
.map(ts=>(ts(0),ts(1).toLong))
.assignTimestampsAndWatermarks(new
UserDefineAssignerWithPunctuatedWatermarks)
.windowAll(TumblingEventTimeWindows.of(Time.seconds(2)))
.allowedLateness(Time.seconds(2))
.sideOutputLateData(lateTag)
.apply(new UserDefineAllWindowFucntion)
result.print("正常")
result.getSideOutput(lateTag).printToErr("太迟")
env.execute("Tumbling Event Time Window Stream")
}
}
补充下自己对最大乱序时间和迟到时间区别的理解
- 最大乱序时间 :是水位线距离看到的最大事件时间差,因为水位线 没过endTime 才会触发 而这个最大乱序时间是 就是给那些 因网络原因晚到的数据准备的时间。来到的原窗口数据还是归到窗口里等水位线没过后再一起计算。
- 最大迟到时间:是水位线没过endTime 后来到的原窗口数据,在允许迟到时间范围内,来一个就和原窗口计算结果计算一次。
- Spark 机构化流 的迟到数据是直接删除的,但他有一个
⽔印延迟
,延迟的这个时间类似于Flink 的 最大乱序时间 。
这是自己和朋友讨论后的理解,如果有大佬发现那里不对 还请评论指正。