1、前言
在使用flink中sql的over window函数一直在想底层是如何实现的,查看资料并没有找到相关说明,去翻源码的话没那么多时间去研究,就按照自己思路简单测试了下实现过程,时间过短,语言没有进行整理,仅供初学者参考
2、测试内容
分别对开窗函数中对行和事件时间进行分析底层实现,加强理解
3、实现过程
1、基于开始到当前行 OverWindow w = Over.partitionBy($("id")).orderBy($("ts")).preceding(UNBOUNDED_ROW).as("w"); 底层个人实现:根据ts最小粒度进行开滚动窗口,使用List状态保存所有值,遍历每个窗口的值,通过list状态计算聚合值,计算一次输出一次 代码实现
public class W20_Window_Over_Row_Test {
public static void main(String[] args) {
Configuration conf = new Configuration();
conf.setInteger("rest.port", 10000);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf);
env.setParallelism(1);
DataStreamSource<String> dataStreamSource = env.socketTextStream("hadoop162", 8888);
SingleOutputStreamOperator<WaterSensor> dataStream = dataStreamSource.map(value->{
String[] split = value.split(" ");
return new WaterSensor(split[0],Long.valueOf(split[1]),Integer.valueOf(split[2]));
});
SingleOutputStreamOperator<WaterSensor> stream = dataStream
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner((element, recordTimestamp) -> element.getTs())
);
stream
.keyBy(WaterSensor::getId)
.window(TumblingEventTimeWindows.of(Time.seconds(1L)))
.process(new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
private ValueState<Integer> sum;
@Override
public void open(Configuration parameters) throws Exception {
sum = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("sum", Integer.class));
}
@Override
public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
ArrayList<WaterSensor> waterSensors = BeansUtils.iterToList(elements);
if(sum.value() == null){
sum.update(0);
}
for (WaterSensor waterSensor : waterSensors) {
sum.update(sum.value() + waterSensor.getVc());
out.collect(waterSensor.toString() + " | " + sum.value() + "|" +context.window());
}
// for (WaterSensor waterSensor : waterSensors) {
// out.collect(waterSensor.toString() + " | " + sum.value() + "|" +context.window());
// }
}
}).print();
2、基于行数到当前行 //OverWindow w = Over.partitionBy($("id")).orderBy($("ts")).preceding(rowInterval(1L)).as("w"); //底层个人实现:根据ts最小粒度进行开滚动窗口,使用List状态保存值,遍历每个窗口的值, // 每添加一个值,当状态的值大于rowInterval+1时,去除状态类第一个值, //计算聚合值,计算一次输出一次
public class W20_Window_Over_Row_Test {
public static void main(String[] args) {
Configuration conf = new Configuration();
conf.setInteger("rest.port", 10000);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf);
env.setParallelism(1);
DataStreamSource<String> dataStreamSource = env.socketTextStream("hadoop162", 8888);
SingleOutputStreamOperator<WaterSensor> dataStream = dataStreamSource.map(value->{
String[] split = value.split(" ");
return new WaterSensor(split[0],Long.valueOf(split[1]),Integer.valueOf(split[2]));
});
SingleOutputStreamOperator<WaterSensor> stream = dataStream
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner((element, recordTimestamp) -> element.getTs())
);
stream
.keyBy(WaterSensor::getId)
.window(TumblingEventTimeWindows.of(Time.seconds(1L)))
.process(new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
private ValueState<Integer> sum;
@Override
public void open(Configuration parameters) throws Exception {
sum = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("sum", Integer.class));
}
@Override
public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
ArrayList<WaterSensor> waterSensors = BeansUtils.iterToList(elements);
if(sum.value() == null){
sum.update(0);
}
for (WaterSensor waterSensor : waterSensors) {
sum.update(sum.value() + waterSensor.getVc());
// out.collect(waterSensor.toString() + " | " + sum.value() + "|" +context.window());
}
for (WaterSensor waterSensor : waterSensors) {
out.collect(waterSensor.toString() + " | " + sum.value() + "|" +context.window());
}
}
}).print();
try {
env.execute("test");
} catch (Exception e) {
e.printStackTrace();
}
}
}
3、基于开始到当前时间 OverWindow w = Over.partitionBy($("id")).orderBy($("ts")).preceding(UNBOUNDED_RANGE).as("w");此为默认值 //底层个人实现:根据ts最小粒度进行开滚动窗口,使用List状态保存所有值,遍历每个窗口的值,通过list状态计算聚合值,计算完后再进行遍历输出
public class W21_Window_Over_Row_Test1 {
public static void main(String[] args) {
int precding = 2;
Configuration conf = new Configuration();
conf.setInteger("rest.port", 10000);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf);
env.setParallelism(1);
DataStreamSource<String> dataStreamSource = env.socketTextStream("hadoop162", 8888);
SingleOutputStreamOperator<WaterSensor> dataStream = dataStreamSource.map(value->{
String[] split = value.split(" ");
return new WaterSensor(split[0],Long.valueOf(split[1]),Integer.valueOf(split[2]));
});
SingleOutputStreamOperator<WaterSensor> stream = dataStream
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner((element, recordTimestamp) -> element.getTs())
);
stream
.keyBy(WaterSensor::getId)
.window(TumblingEventTimeWindows.of(Time.seconds(1L)))
.process(new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
private ListState<WaterSensor> sum;
@Override
public void open(Configuration parameters) throws Exception {
sum = getRuntimeContext().getListState(new ListStateDescriptor<WaterSensor>("sum", WaterSensor.class));
}
@Override
public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
ArrayList<WaterSensor> waterSensors = BeansUtils.iterToList(elements);
ArrayList<WaterSensor> sumList = BeansUtils.iterToList(sum.get());
for (WaterSensor waterSensor : waterSensors) {
if (waterSensor.getTs() > 0) {
sumList.add(waterSensor);
if (sumList.size() >= precding + 2) {
sumList.remove(0);
}
int sum = 0;
for (WaterSensor sensor : sumList) {
sum += sensor.getVc();
}
out.collect(waterSensor.toString() + " | " + sum + " | " + context.window());
}
}
sum.update(sumList);
}
}).print();
try {
env.execute("test");
} catch (Exception e) {
e.printStackTrace();
}
}
}
4、基于时间到当前时间 OverWindow w = Over.partitionBy($("id")).orderBy($("ts")).preceding(lit(2).second()).as("w"); //底层个人实现:根据ts最小粒度进行开滚动窗口,使用List状态保存值,遍历每个窗口的值,每添加一个值,遍历状态,去除时间小于事件时间-lit的值 //注意:状态里的值都是按照事件时间顺序添加的,遍历时只需每次判断第一个值是否符合要求,不符合remove后,下次还是判断第一个值 // 再进行聚合运算,计算完成后再进行遍历输出
public class W22_Window_Over_Row_Test2 {
public static void main(String[] args) {
Long precding = 2L;
Configuration conf = new Configuration();
conf.setInteger("rest.port", 10000);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf);
env.setParallelism(1);
DataStreamSource<String> dataStreamSource = env.socketTextStream("hadoop162", 8888);
SingleOutputStreamOperator<WaterSensor> dataStream = dataStreamSource.map(value -> {
String[] split = value.split(" ");
return new WaterSensor(split[0], Long.valueOf(split[1]), Integer.valueOf(split[2]));
});
SingleOutputStreamOperator<WaterSensor> stream = dataStream
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner((element, recordTimestamp) -> element.getTs())
);
stream
.keyBy(WaterSensor::getId)
.window(TumblingEventTimeWindows.of(Time.seconds(1L)))
.process(new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
private ListState<WaterSensor> sum;
@Override
public void open(Configuration parameters) throws Exception {
sum = getRuntimeContext().getListState(new ListStateDescriptor<WaterSensor>("sum", WaterSensor.class));
}
@Override
public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
ArrayList<WaterSensor> waterSensors = BeansUtils.iterToList(elements);
ArrayList<WaterSensor> sumList = BeansUtils.iterToList(sum.get());
for (WaterSensor waterSensor : waterSensors) {
if (waterSensor.getTs() > 0) {
sumList.add(waterSensor);
for (int i = 0; i < sumList.size(); i++) {
if (sumList.get(0).getTs() < waterSensor.getTs() - precding * 1000) {
sumList.remove(0);
}
}
}
}
for (WaterSensor waterSensor : waterSensors) {
if (waterSensor.getTs() > 0) {
int sum = 0;
for (WaterSensor sensor : sumList) {
sum += sensor.getVc();
}
out.collect(waterSensor.toString() + " | " + sum + " | " + context.window());
}
}
sum.update(sumList);
}
}).print();
try {
env.execute("test");
} catch (Exception e) {
e.printStackTrace();
}
}
}
4、总结
1、测试得知状态里面不会添加ts<=0的值
2、使用滚动窗口来确保每个值保存一次,不会重复计算,实际用来聚合运算的采用状态
3、(1)根据行数来确定聚合(row)的底层实现滚动窗口大小不影响结果,仍然是每个语句输出一次,只是用来确定更新和反应速度,窗口越大,速度越快
(2)根据时间(range)来确定底层实现滚动窗口大小需要结合事件事件最小粒度来实现,因为要确保同一粒度时间内的值计算时在同一个窗口
4、根据目前知识,估计一般为1s,目前flink只支持秒级的事件时间转换,相关配置需要确定后得出结论
5、根据目前测试结果,row和时间的开窗底层实现区别在于计算一次输出一次还是计算完成后再遍历输出