本文用java代码介绍flink的各类算子。采用上篇文章中对接的kafka数据源。

一、Map:对数据进行逐个遍历,常用作对数据集内数据的清洗和转换

input.print();    
SingleOutputStreamOperator<Tuple2<String, Integer>> map = input.map(new MapFunction<String, Tuple2<String, Integer>>() {
      @Override
      public Tuple2<String, Integer> map(String s) throws Exception {
          return new Tuple2<>(s, 1);
      }
 });
map.print();

对来的每一条数据,添加一个元素1,组成一个tuple二元组返回。

kafka输入:

flink flatmap 异常throw flink map flatmap_字段

代码输出:

flink flatmap 异常throw flink map flatmap_flink_02

二、FlatMap:输入一个元素,产生多个元素,将元素“炸开”

input.print();
        SingleOutputStreamOperator<String> stringSingleOutputStreamOperator = input.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String s, Collector<String> collector) throws Exception {
                String[] strs = s.split(",");
                for (String str : strs) {
                    collector.collect(str);
                }
            }
        });
         stringSingleOutputStreamOperator.print();

kafka输入:

flink flatmap 异常throw flink map flatmap_ide_03

代码输出:

flink flatmap 异常throw flink map flatmap_字段_04

三、Filter:对输入的元素进行筛选过滤

input.print();
        SingleOutputStreamOperator<String> hello = input.filter(new FilterFunction<String>() {
            @Override
            public boolean filter(String s) throws Exception {
                return !s.equals("hello");
            }
        });
        hello.print();

kafka输入:

flink flatmap 异常throw flink map flatmap_字段_05

程序输出:

flink flatmap 异常throw flink map flatmap_字段_06

四、KeyBy:输入DataStream,输出KeyedStream.将相同key值的数据放置在相同的分区中。

input.print();
        SingleOutputStreamOperator<String> stringSingleOutputStreamOperator = input.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String s, Collector<String> collector) throws Exception {
                String[] strs = s.split(",");
                for (String str : strs) {
                    collector.collect(str);
                }
            }
        });
        SingleOutputStreamOperator<Tuple2<String, Integer>> map = stringSingleOutputStreamOperator.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                return new Tuple2<>(s, 1);
            }
        });

        KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream = map.keyBy(0);
        tuple2TupleKeyedStream.print();

kafka输入:

flink flatmap 异常throw flink map flatmap_kafka_07

程序输出:

flink flatmap 异常throw flink map flatmap_字段_08

keyby(0)的意思就是拿第一个字段做聚合,所以,所有的第0个字段为a的元素都被分到一个分组中,由3号子任务输出,所有的第0个字段为b的元素都被1号子任务输出。

五、Reduce:输入KeyedStream,输出DataStream.将数据进行聚合处理,需要用户自定义ReduceFunction,且必要满足结合律和交换律。对上一步的KeyBy结果继续计算,代码如下:

input.print();
        SingleOutputStreamOperator<String> stringSingleOutputStreamOperator = input.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String s, Collector<String> collector) throws Exception {
                String[] strs = s.split(",");
                for (String str : strs) {
                    collector.collect(str);
                }
            }
        });
        SingleOutputStreamOperator<Tuple2<String, Integer>> map = stringSingleOutputStreamOperator.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                return new Tuple2<>(s, 1);
            }
        });

        KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream = map.keyBy(0);
//        tuple2TupleKeyedStream.print();
        SingleOutputStreamOperator<Tuple2<String, Integer>> reduce = tuple2TupleKeyedStream.reduce(new ReduceFunction<Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> reduce(Tuple2<String, Integer> stringIntegerTuple2, Tuple2<String, Integer> t1) throws Exception {
                String key1 = stringIntegerTuple2.f0;
                int value1 = stringIntegerTuple2.f1;
                int value2 = t1.f1;
                return new Tuple2<String, Integer>(key1, value1 + value2);
            }
        });
        reduce.print();

kafka输入:

flink flatmap 异常throw flink map flatmap_字段_09

程序输出:

flink flatmap 异常throw flink map flatmap_ide_10

可以看到,key为a和b的元素,已经将value的值累加起来了。

六、Aggregations:输入KeyedStream,输出DataStream.其实是封装了一些聚合函数,sum、min、minBy、max、maxBy等等。我把输入元素转换为三元组,方便确定到具体元素。

1.sum:

input.print();
        SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> tuple3SingleOutputStreamOperator = input.flatMap(new FlatMapFunction<String, Tuple3<String, Integer, Integer>>() {
            @Override
            public void flatMap(String s, Collector<Tuple3<String, Integer, Integer>> collector) throws Exception {
                String[] strs = s.split(",");
                for (String str : strs) {
                    String[] kv = str.split(":");
                    collector.collect(new Tuple3<String, Integer, Integer>(kv[0], Integer.valueOf(kv[1]), (int) (Math.random() * 10)));
                }
            }
        });
//        tuple2SingleOutputStreamOperator.print();
        KeyedStream<Tuple3<String, Integer, Integer>, Tuple> tuple3TupleKeyedStream = tuple3SingleOutputStreamOperator.keyBy(0);

        tuple3TupleKeyedStream.sum(1).print("sum");
//        tuple3TupleKeyedStream.min(1).print("min");
//        tuple3TupleKeyedStream.max(1).print("max");
//        tuple3TupleKeyedStream.minBy(1).print("minBy");
//        tuple3TupleKeyedStream.maxBy(1).print("maxBy");

kafka输入:

flink flatmap 异常throw flink map flatmap_kafka_11

因为我们可以确定到具体元素,所以我们可以跟一下代码,看看究竟会有什么输出。(我认为截图可以表达的更清楚)。

进入到KeyedStream.java的sum()方法。

flink flatmap 异常throw flink map flatmap_kafka_12

进入到SumAggregator类里面,打断点继续跟。

flink flatmap 异常throw flink map flatmap_flink_13

显然,这个if判断为true。在该类的reduce()方法打断点跟。

flink flatmap 异常throw flink map flatmap_ide_14

value1是(a,1,5),value2是(a,2,6)。把value1拷贝,传参给fieldAccessor的set方法。fieldAccessor的实例化如下:

flink flatmap 异常throw flink map flatmap_字段_15

get方法如下:

flink flatmap 异常throw flink map flatmap_字段_16

 

所以get(value1)会返回1,get(value2)会返回2。adder会把两个值加起来。

flink flatmap 异常throw flink map flatmap_ide_17

刚刚说到把value1拷贝,传参给传参给fieldAccessor的set方法。所以,会把value1的第1个字段更新成add之后的值,然后返回。

flink flatmap 异常throw flink map flatmap_flink_18

所以,输出结果如下:

flink flatmap 异常throw flink map flatmap_flink_19

2.min和minBy:查阅官方资料,(Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy).)。翻译(在KeyedStream上滚动聚合。min和minBy之间的区别是,min返回最小值,而minBy返回该字段中具有最小值的元素(max和maxBy也是如此)。)。拿(a,1)和(a,2)两个元素,按照官方说法,min会返回最小值1,minBy会返回最小值的元素(a,1,random),但是我的实验结果却不是这样。继续看源代码。这次不会从头跟断点了,只看关键的ComparableAggregator类中的reduce方法。

 

先来看min:

kafka输入:

flink flatmap 异常throw flink map flatmap_flink_20

flink flatmap 异常throw flink map flatmap_flink_21

value1是(a,1,6),value2是(a,2,3)

flink flatmap 异常throw flink map flatmap_字段_22

min不是byAggregate,所以进入else

flink flatmap 异常throw flink map flatmap_flink_23

c=1,说明value1小(要看comparator对象的实例化过程)。所以直接返回value1。返回了元素,而不是较小的值。这个和官网有出入,不知道是我操作有问题,还是我的版本有问题(我是1.7.0版本)。还请明白的大佬指教。

程序输出:

flink flatmap 异常throw flink map flatmap_字段_24

再看minBy:

kafka输入: 

flink flatmap 异常throw flink map flatmap_字段_25

flink flatmap 异常throw flink map flatmap_flink_26

这次if(byAggregate)判断为true,c=1,所以返回value1.

程序输出结果为:

flink flatmap 异常throw flink map flatmap_kafka_27

七、纯原创,希望大家尊重我的劳动成果。这篇先写到这里,有时间再继续介绍。如果文章有错误,欢迎指正。