只要水印watermark的时间大于等于窗口的结束时间,并且窗口内有数据存在,就会触发对应窗口计算。
除此之外,如果flink配置了allowedLateness参数,只要水印watermark的时间小于等于窗口的结束时间加上allowedLateness参数时间,将会重新触发对应窗口的计算。
滚动窗口联系watermark:
package Flink_Window;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import javax.annotation.Nullable;
import java.text.SimpleDateFormat;
//watermark滚动窗口案例
public class SocketCountWindowEvent {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//连接socket
DataStreamSource<String> dataStreamSource = env.socketTextStream("192.168.208.121", 8888, "\n");
DataStream<String> streamOperator = dataStreamSource.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<String>() {
//设定水印,水印是只增不减的
//maxOutOfOrderness表示允许数据最大乱序事件是30s
Long maxOutOfOrderness = 30000L;//30s
Long currentMaxTimestamp =0L;
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp-maxOutOfOrderness);
}
@Override
public long extractTimestamp(String s, long l) {
String[] split =s.split(",");
long event_time=Long.parseLong(split[1]);
//水位线只增不减
currentMaxTimestamp=Math.max(event_time,currentMaxTimestamp);
return event_time;
}
});
//WordCount程序主逻辑
DataStream<Tuple2<String,Integer>> windowCounts = streamOperator.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
String[] split = s.split("\\W+");
collector.collect(Tuple2.of(split[0],1));
}
});
DataStream<Tuple2<String, Integer>> result=windowCounts.keyBy(0)
.timeWindow(Time.seconds(60))
.process(new ProcessWindowFunction<Tuple2<String,Integer>, Tuple2<String,Integer>, Tuple, TimeWindow>() {
SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
@Override
public void process(Tuple tuple, Context context, Iterable<Tuple2<String, Integer>> v2s, Collector<Tuple2<String,Integer>> collector) throws Exception {
System.out.print("+++++++++++++++++++");
System.out.print("算子是:"+Thread.currentThread().getId()+"窗口范围是:"+sdf.format(context.window().getStart())+"\t"+sdf.format(context.window().getEnd()));
//查看那个水印出发了计算
System.out.print(context.currentWatermark()+"\t"+sdf.format(context.currentWatermark()));
int sum = 0;
for(Tuple2<String,Integer> v2:v2s){
sum += 1;
}
collector.collect(Tuple2.of(tuple.getField(0),sum));
System.out.print("+++++++++++++++++++下++++++++++++++++++");
}
});
result.print();
env.execute("SocketCountWindowEvent");
}
}
watermark被认为是:eventtime减去30秒,这就是watermark的时间。
默认处理,当窗口被执行过后,数据再过来,Flink就会被遗弃掉。
Flink应该如何设置最大乱序时间maxOutOfOrderness
这个要结合自己的业务以及数据情况去设置。如果maxOutOfOrderness设置太小,而自身数据发送时由于网络等原因导致乱序或者late太多,
那么最终的结果就是会有很多单条的数据在window中被处罚,数据的正确性影响太大对于严重乱序的数据,需要严格统计数据最大延迟时间,才能保证计算的数据准确,
延时设置太小会影响数据准确性,延时设置太大不仅影响数据的实时性,更加会加重Flink作业的负担,不是对eventTime要求特别严格的数据,尽量不要采用eventTime方式来处理,会有丢数据的风险。
滑动窗口的watermark案例:
//滚动改为滑动窗口
// .timeWindow(Time.seconds(60))
.timeWindow(Time.seconds(60),Time.seconds(30))
Flink延迟数据三种处理方式:(延迟数据:eventTime<watermark时间的数据)
1、丢弃(默认的处理方式)
在Flink当中,如果输入数据所在的窗口已执行过了,Flink对这些延迟数据的处理方案默认就是丢弃,而不是再次出发响应的window窗口。
2、allowedLateness指定允许数据延迟的时间
在Flink当中,当输入数据所在的窗口已经执行过了,默认情况下即使再来心的数据,window也不会再次出发,但是如果我们希望再次被触发咋么解决?
即在某些情况下,我们希望为延迟的数据提供一个宽容的时间。Flink提供了allowwedLateness方法,它可以实现对延迟数据设置一个延迟的时间,在指定延迟时间内到达的数据可以被再次出发窗口window计算。
在这里我们可以用一个列子来说明问题:
maxOutOfOrderness表示允许数据的最大乱序时间:好比我们的大货轮10:00开船,但是大货轮给乘客提供了5分钟的延迟时间,10:05开船;
allowedLateness表示是否可以再次触发窗口的延迟时间:好比大货轮10:05已经开船,但是大货轮又给乘客提供了2分钟的延迟时间,即只要大货轮在2min内的触发时间,都可以给你提供一个,让你再次爬山来。waterMark允许数据延迟时间与这个数据延迟的区别是;allowedLateness允许延迟时间在Watermark允许延迟时间的基础上增加的时间。
所谓的延迟数据,即窗口已经因为watermark进行了触发,则在此之后如果还有数据进入窗口,则默认情况下不会对窗口window进行再次触发和聚合计算。要想在数据进入已经被触发过的窗口后,还能继续触发窗口计算,则可以使用延迟数据处理机制。
因此如果你加上了allowedLateness参数,窗口的触发条件是:
a、窗口第一次触发是在Watermark时间>=window中的window_end_time;
b、第二次(或多次)触发的条件是Watermark时间<Window中的window_end_time + allowedLateness3、sideOutputLateDate手机延迟数据
通过sideOutputLatedate函数可以把延迟的数据统一收集,统一存储,方便后续排查问题。
处理延迟数据:
package Flink_Window;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import javax.annotation.Nullable;
import java.text.SimpleDateFormat;
//watermark滑动窗口案例
public class SocketCountWindowEvent {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//连接socket
DataStreamSource<String> dataStreamSource = env.socketTextStream("192.168.208.121", 8888, "\n");
DataStream<String> streamOperator = dataStreamSource.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<String>() {
//设定水印,水印是只增不减的
//maxOutOfOrderness表示允许数据最大乱序事件是30s
Long maxOutOfOrderness = 30000L;//30s
Long currentMaxTimestamp =0L;
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp-maxOutOfOrderness);
}
@Override
public long extractTimestamp(String s, long l) {
String[] split =s.split(",");
long event_time=Long.parseLong(split[1]);
//水位线只增不减
currentMaxTimestamp=Math.max(event_time,currentMaxTimestamp);
return event_time;
}
});
//WordCount程序主逻辑
DataStream<Tuple2<String,Integer>> windowCounts = streamOperator.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
String[] split = s.split("\\W+");
collector.collect(Tuple2.of(split[0],1));
}
});
//保存被丢弃的数据
OutputTag<Tuple2<String,Integer>> outputTag=new OutputTag<Tuple2<String,Integer>>("late_date"){};
//注意:由于getSideOutput方法是SingleOutputStreamOperator类中特有的方法,所以这里不能用DataStream.
SingleOutputStreamOperator<Tuple2<String, Integer>> result=windowCounts.keyBy(0)
//滚动改为滑动窗口
.timeWindow(Time.seconds(60))
//保存被丢弃的数据
.sideOutputLateData(outputTag)
.process(new ProcessWindowFunction<Tuple2<String,Integer>, Tuple2<String,Integer>, Tuple, TimeWindow>() {
SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
@Override
public void process(Tuple tuple, Context context, Iterable<Tuple2<String, Integer>> v2s, Collector<Tuple2<String,Integer>> collector) throws Exception {
System.out.print("+++++++++++++++++++");
System.out.print("算子是:"+Thread.currentThread().getId()+"窗口范围是:"+sdf.format(context.window().getStart())+"\t"+sdf.format(context.window().getEnd()));
//查看那个水印出发了计算
System.out.print(context.currentWatermark()+"\t"+sdf.format(context.currentWatermark()));
int sum = 0;
for(Tuple2<String,Integer> v2:v2s){
sum += 1;
}
collector.collect(Tuple2.of(tuple.getField(0),sum));
System.out.print("+++++++++++++++++++下++++++++++++++++++");
}
});
//将延迟的数据暂时打印到控制台,实际中可以保存到其他存储介质当中
DataStream<Tuple2<String,Integer>> sideOutput=result.getSideOutput(outputTag);
sideOutput.print();
result.print();
env.execute("SocketCountWindowEvent");
}
}