一、Flink基础

1、什么是Flink?数据模型、体系架构、生态圈

官方解释:

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Flink中处理的两种数据集合:

(1)Unbounded data stream:无边界的数据集 —> 流式计算、实时计算 —> Flink DataStream API

(2)bounded data stream:有边界的数据集 —> 离线计算、批处理 —> Flink DataSet API

Flink的体系架构:

flink的规则过滤 flink 规则引擎编排_scala


flink的体系架构和我们前面所提到的spark,storm等,都是非常相似的,都是采用主从架构的思想。主节点为:jobManager、从节点:taskManager。每个从节点开辟进程,进程里面的task为最小的执行任务单位。Flink的生态圈:

flink的规则过滤 flink 规则引擎编排_big data_02


从上图可以看出,Flink也分为离线计算和实时计算。

2、部署Flink

(1)Standalone模式

tar -zxvf flink-1.11.0-bin-scala_2.12.tgz -C ~/training/
核心配置文件 conf/flink-conf.yaml
Web UI:端口 8081

  • 伪分布的环境
    直接运行就行,flink已经默认配置好
    bin/start-cluster.sh
  • flink的规则过滤 flink 规则引擎编排_flink的规则过滤_03

  • 全分布的环境

(2)Flink on Yarn

把Flink中的任务放到yarn上进行执行。有两种模式:我们一般采用模式二。
注意:我们要使用Hadoop,必须把flink-shaded-hadoop-2-uber-2.8.3-10.0.jar 包添加到lib目录下,因为在flink1.10版本后,flink把Hadoop相关依赖去除。

(模式一)内存集中管理的模式
yarn初始化一个集群,开辟指定的资源,我们提交job都在这个flink yarn-session中,也就是不管有多少个job,这些job都会共用yarn中申请的资源。

bin/yarn-session.sh -n 2  -jm 1024 -tm 1024 -d

(模式二)内存Job的管理模式:
在yarn中,每次提交job都会创建一个新的flink集群,任务之间相互独立,互不影响并且方便管理,任务执行完后,创建的集群也会消失。

bin/flink run -m yarn-cluster -p 1 -yjm 1024 -ytm 1024 examples/batch/WordCount.jar
注意:-p 1 指任务的并行度

flink的规则过滤 flink 规则引擎编排_apache_04

(3)HA模式:基于ZooKeeper

3、执行Flink的任务:WordCount

Flink中example中也提供一些例子。可以去查看一下。
(1)离线计算-批处理

bin/flink run examples/batch/WordCount.jar -input hdfs://bigdata111:9000/input/data.txt -output hdfs://bigdata111:9000/flink/wc

(2)流式计算-实时计算

bin/flink run examples/streaming/SocketWindowWordCount.jar --port 1234

4、开发自己的Flink程序:WordCount(Java)

添加依赖:

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-streaming-java_2.11</artifactId>
	<version>1.11.0</version>
	<!--<scope>provided</scope>-->
</dependency>

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-clients_2.11</artifactId>
	<version>1.11.0</version>
</dependency>

(1)离线计算-批处理

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;


/**
 * @author 
 * @date 2020/11/21-${TIEM}
 */
public class WordCountBatchExample {
    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
        //创建一个DataSet来代表处理的数据
        DataSource<String> data = env.fromElements("i love beijing",
                "i love chain ", "chain is the captial of the beijing");

        data.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    collector.collect(new Tuple2<String, Integer>(word,1));
                }
            }
        }).groupBy(0).sum(1).print();

    }
}

(2)流式计算-实时计算

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * @author 
 * @date 2020/11/21-${TIEM}
 */
public class WordCountStreamExample {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment sen = StreamExecutionEnvironment.getExecutionEnvironment();

        //接收输入
        DataStreamSource<String> source = sen.socketTextStream("bigdata111", 1234);

        source.flatMap(new FlatMapFunction<String, WordCount>() {
            public void flatMap(String s, Collector<WordCount> collector) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    collector.collect(new WordCount(word,1));
                }
            }
        }).keyBy("word").sum("count").print().setParallelism(1);
        

        sen.execute("WordCountStreamExample");

    }
}

在Linux bigdata111上启动 :nc -l 1234

5、对比:Storm、Spark Streaming、Flink的技术特点

flink的规则过滤 flink 规则引擎编排_flink的规则过滤_05

二、Flink DataSet API ----> 离线计算-批处理

Mysql读取、写入

添加依赖

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-jdbc_2.11</artifactId>
    <version>1.9.1</version>
</dependency>

Java实现

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        //读mysql
        DataSource<Row> dataSource = env.createInput(JDBCInputFormat.buildJDBCInputFormat()
                .setDrivername("com.mysql.jdbc.Driver")
                .setDBUrl("jdbc:mysql://localhost:3306")
                .setUsername("root")
                .setPassword("root")
                .setQuery("select id,dq from flink.ajxx_xs ")
                .setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO))
                .finish());
 
        final BatchTableEnvironment tableEnv = BatchTableEnvironment.create(env);
        tableEnv.registerDataSet("ods_tb01", dataSource);
 
        Table query = tableEnv.sqlQuery("select * from ods_tb01");
 
        DataSet<Row> result = tableEnv.toDataSet(query, Row.class);
 
        result.print();
 
        result.output(JDBCOutputFormat.buildJDBCOutputFormat()
                .setDrivername("com.mysql.jdbc.Driver")
                .setDBUrl("jdbc:mysql://localhost:3306")
                .setUsername("root")
                .setPassword("root")
                .setQuery("insert into flink.ajxx_xs2 (id,dq) values (?,?)")
                .setSqlTypes(new int[]{Types.INTEGER, Types.NCHAR})
                .finish());
 
        env.execute("flink-test");

算子介绍

算子

解释

map

输入一个元素,返回一个元素,中间可以做清洗转换操作

FlatMap

输入一个元素,可以返回0个,一个或对个元素

map与flatmap区别

map输入一个数据只能有一次输出,flatmap输入一个数据,可以有多个输出

MapPartition

类似map,一次处理一个分区的数据

Filter

过滤函数,对传入的数据进行判断,符合条件的留下

Reduce

对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,返回一个新的值

Aggregate

聚合操作,sum、max、min等

distinct

去重之后的元素

join

内连接

outerJoin

外连接

cross

获取两个数据集的笛卡尔积

Union

返回两个数据集的总和,数据类型必须一致

first-n

获取集合中前 N 个元素

Sort Partition

在本地数据集的所有分区进行排序,通过SortPartition()的连接调用来完成对多个字段的排序

(1)Map、FlatMap与MapPartition,注意区别

public class FlinkDemo1 {
    public static void main(String[] args) throws Exception {

        // 准备数据
        ArrayList<String> data = new ArrayList<String>();
        data.add("I love Beijing");
        data.add("I love China");
        data.add("Beijing is the capital of China");
        
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSource<String> dataSource = env.fromCollection(data);

        /*
        * 输入一个元素,返回一个元素,中间可以做清洗转换操作
        *
        * */
        MapOperator<String, List<String>> map = dataSource.map(new MapFunction<String, List<String>>() {
            public List<String> map(String value) throws Exception {
                String[] words = value.split(" ");

                //创建一个List
                List<String> list = new ArrayList<String>();
                for (String w : words) {
                    list.add(w);
                }
                list.add("-----------");
                return list;
            }
        });
        map.print();
        System.out.println("**************************");

        /*
        *输入一个元素,可以返回0个,一个或对个元素
        *
        * */
        dataSource.flatMap(new FlatMapFunction<String, String>() {
            public void flatMap(String value, Collector<String> out) throws Exception {
                String[] words = value.split(" ");
                for(String w:words) {
                    out.collect(w);
                }
            }
        }).print();
        System.out.println("************************");

        /*
        * mappartion
        * 拿到的是这个分区的元素,常用于连接数据库等操作
        * */
        dataSource.mapPartition(new MapPartitionFunction<String, String>() {
            public void mapPartition(Iterable<String> iterable, Collector<String> out) throws Exception {
                Iterator<String> ite = iterable.iterator();
                while (ite.hasNext()){
                    String next = ite.next();
                    String[] s = next.split(" ");
                    for (String s1 : s) {
                        out.collect(s1);
                    }

                }
            }
        }).print();

    }
}

(2)Filter与Distinct

public class FlinkDemo2 {
    public static void main(String[] args) throws Exception {
        // 准备数据
        ArrayList<String> data = new ArrayList<String>();
        data.add("I love Beijing");
        data.add("I love China");
        data.add("Beijing is the capital of China");
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSource<String> dataSource = env.fromCollection(data);

        final FlatMapOperator<String, String> word = dataSource.flatMap(new FlatMapFunction<String, String>() {
            public void flatMap(String s, Collector<String> collector) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    collector.collect(word);
                }
            }
        });
        // 筛选长度大于3
        FilterOperator<String> filterword = word.filter(new FilterFunction<String>() {
            public boolean filter(String s) throws Exception {

                return s.length() >= 3 ? true : false;

            }
        });

        filterword.distinct().print();


    }
}

(3)Join操作,内连接

内连接:将两个表连接,仅仅返回匹配(主键与外键的匹配)条件的行的连接成为内连接。

public class FlinkDemo3 {
    public static void main(String[] args) throws Exception {
        //创建第一张表:用户表(用户ID、姓名)
        ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
        data1.add(new Tuple2<Integer, String>(1,"Tom"));
        data1.add(new Tuple2<Integer, String>(2,"Mike"));
        data1.add(new Tuple2<Integer, String>(3,"Mary"));
        data1.add(new Tuple2<Integer, String>(4,"Jone"));

        //创建第二张表:用户所在城市(用户ID、城市)
        ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
        data2.add(new Tuple2<Integer, String>(1,"北京"));
        data2.add(new Tuple2<Integer, String>(2,"上海"));
        data2.add(new Tuple2<Integer, String>(3,"北京"));
        data2.add(new Tuple2<Integer, String>(4,"深圳"));

        // 创建执行的运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
        DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);


        //执行join操作
        //where(0).equalTo(0) 表示:使用第一张表的第一个列,连接第二张表的第一个列
        //相当于   where table1.userID = table2.userID
        table1.join(table2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>,
                Tuple2<Integer, String>,
                Tuple2<String, String>>() {
            public Tuple2<String, String> join(Tuple2<Integer, String> t1, Tuple2<Integer, String> t2) throws Exception {
                return new Tuple2<String, String>(t1.f1,t2.f1);
            }
        }).print();
    }
}

(4)外链接、全连接操作

外连接:与内连接相反,把没有匹配(主键与外键的匹配)成功的行也进行返回。

public class FlinkDemo4 {
    public static void main(String[] args) throws Exception {
        //创建第一张表:用户表(用户ID、姓名)
        ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
        data1.add(new Tuple2<Integer, String>(1,"Tom"));
        data1.add(new Tuple2<Integer, String>(3,"Mary"));
        data1.add(new Tuple2<Integer, String>(4,"Jone"));

        //创建第二张表:用户所在城市(用户ID、城市)
        ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
        data2.add(new Tuple2<Integer, String>(1,"北京"));
        data2.add(new Tuple2<Integer, String>(2,"上海"));
        data2.add(new Tuple2<Integer, String>(4,"深圳"));

        // 创建执行的运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
        DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);
        
        // 左外连接
        System.out.println("左外连接:");
        table1.leftOuterJoin(table2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer,String,String>>() {
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
                        return right == null ? new Tuple3<Integer, String, String>(left.f0 , left.f1 , null) :
                                new Tuple3<Integer, String, String>(left.f0,left.f1,right.f1);
                    }
                }).print();

        // 右外连接
        System.out.println("右外连接:");
        table1.rightOuterJoin(table2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer,String,String>>() {
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
                        return left == null ? new Tuple3<Integer, String, String>(right.f0 , right.f1 , null) :
                                new Tuple3<Integer, String, String>(left.f0,left.f1,right.f1);
                    }
                }).print();

        // 全连接
        System.out.println("全连接");
        table1.fullOuterJoin(table2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
                        if(left == null) {
                            return new Tuple3<Integer, String, String>(right.f0,null,right.f1);
                        }else if(right == null) {
                            return new Tuple3<Integer, String, String>(left.f0,left.f1,null);
                        }else {
                            return new Tuple3<Integer, String, String>(right.f0,left.f1,right.f1);
                        }
                    }
                }).print();

    }
}

(5)笛卡尔积

public class FlinkDemo5 {
    public static void main(String[] args) throws Exception {
        //创建第一张表:用户表(用户ID、姓名)
        ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
        data1.add(new Tuple2<Integer, String>(1,"Tom"));
        data1.add(new Tuple2<Integer, String>(2,"Mike"));
        data1.add(new Tuple2<Integer, String>(3,"Mary"));
        data1.add(new Tuple2<Integer, String>(4,"Jone"));

        //创建第二张表:用户所在城市(用户ID、城市)
        ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
        data2.add(new Tuple2<Integer, String>(1,"北京"));
        data2.add(new Tuple2<Integer, String>(2,"上海"));
        data2.add(new Tuple2<Integer, String>(3,"北京"));
        data2.add(new Tuple2<Integer, String>(4,"深圳"));

        // 创建执行的运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
        DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);

        // 笛卡尔积
        table1.cross(table2).print();


    }
}

(6)First-N 、 sortPartition(SQL:Top-N)

注意:导包别导错了

public class FlinkDemo6 {
    public static void main(String[] args) throws Exception {
        //Tuple3: 姓名 薪水 部门号
        ArrayList<Tuple3<String,Integer,Integer>> data1 = new ArrayList<Tuple3<String,Integer,Integer>>();

        data1.add(new Tuple3<String,Integer,Integer>("Tom",1000,10));
        data1.add(new Tuple3<String,Integer,Integer>("Mary",2000,20));
        data1.add(new Tuple3<String,Integer,Integer>("Mike",1500,30));
        data1.add(new Tuple3<String,Integer,Integer>("Jone",1800,10));

        // 创建执行的运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        //构造一个DataSet
        DataSet<Tuple3<String,Integer,Integer>> table = env.fromCollection(data1);

        //取出前三条记录
        table.first(3).print();

        System.out.println("********************");
        //先按照部门号排序,再按照薪水排序
        table.sortPartition(2, Order.ASCENDING).sortPartition(1,Order.DESCENDING).print();


    }
}

三、Flink DataStream API ----> 流式计算-实时计算

1、DataSource 数据源

(1)自定义的数据源,实现接口:

SourceFunction:并行度1
ParalleSourceFunction:多并行度

创建MySingleDataSourceTest 类进行测试

public class MySingleDataSourceTest  {
    public static void main(String[] args) throws Exception {

        // 创建一个执行环境
        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();

        //指定单并行度的数据源
        DataStreamSource<Integer> source = sEnv.addSource( new MySingleDataSource());
        DataStream<Integer> step1 =  source.map(new MapFunction<Integer, Integer>() {

            public Integer map(Integer value) throws Exception {
                System.out.println("收到的数据是:"+ value);
                return value*10;
            }
        });

        //每隔两秒做一次求和
        step1.timeWindowAll(Time.seconds(2)).sum(0).setParallelism(1).print();

        sEnv.execute("MySingleDataSourceTest");
    }
}

创建MySingleDataSource 类,集成SourceFunction接口,实现自定义数据源。

public class MySingleDataSource implements SourceFunction<Integer> {

    //计数器
    private Integer count = 1;
    //开关
    private boolean isRunning = true;

    public void run(SourceContext<Integer> ctx) throws Exception {

        // 如何产生数据
        while(isRunning) {
            //输出数据
            ctx.collect(count);

            //每隔一秒
            Thread.sleep(1000);
            //自增
            count ++;
        }

    }

    public void cancel() {
        // 如何停止产生数据
        isRunning = false;
    }
}

(2)kafka数据源

Java代码如下

public class FlinkStreamWithKafka {
    public static void main(String[] args) throws Exception {
        // 创建一个KafkaStream
        Properties props = new Properties();
        // 指定Broker的地址
        props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.92.111:9093");
        // 消费者组
        props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "mygroup1");

        FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<String>("mydemotopic1", new SimpleStringSchema(), props);

        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source1  = sEnv.addSource(source);
        source1.print();

        sEnv.execute("FlinkStreamWithKafka");
    }
}

2、转换操作 Transformation

算子

解释

map

输一个元素,返回一个元素,中间做清洗转化操作

flatmap

输入一个元素,可以返回0个,一个或对个元素

filter

过滤函数,对传入的数据进行判断,符合条件的留下

keyBy

指定key进行分组,相同的key的数据会进入同一个分区

reduce

对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,返回一个新的值

aggregations

sum()、min() 、 max()等

window

后面详解

union

返回两个数据集的总和,数据类型必须一致

connect

和union类似,但是只能连接两个流,两个流的数据类型可以不同,会对两个流中数据应用不同的处理方法

coMap 、 CoFlatMap

在connectedStreams中需要使用这种函数,类似map和flatmap

split

根据规则吧一个数据流切分为多个流,打标签

select

和split配合使用,选择切分后的流 , 选择标签组合一个新的流

(1)union 合并两个流

public class FlinkDemo1 {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();

        //创建两个DataStream Source
        DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());
        DataStreamSource<Integer> source2 = sEnv.addSource(new MySingleDataSource());
        //执行union计算:并集,两个流的数据合并到一起
        DataStream<Integer> resault = source1.union(source2);
        resault.print().setParallelism(1);

        sEnv.execute("FlinkDemo1");
    }
}

(2)connect

public class FlinkDemo2 {
    public static void main(String[] args) throws Exception {
        // 使用之前创建的并行度为1的数据源进行测试
        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();

        //创建两个DataStream Source
        DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());

        DataStream<String> source2 = sEnv.addSource(new MySingleDataSource())
                .map(new MapFunction<Integer, String>() {
                    public String map(Integer value) throws Exception {
                        // 把Integer转换为String
                        return "String"+value;
                    }
                });


        // 可以包括不同的数据类型
        ConnectedStreams<Integer, String> connect = source1.connect(source2);
        // 处理不同的数据类型,返回不同的结果
        connect.map(new CoMapFunction<Integer, String, Object>() {
            public Object map1(Integer value) throws Exception {
                // 对第一个数据流进行处理
                return "对Integer类型的数据流进行处理:"+ value;
            }

            public Object map2(String value) throws Exception {
                // 对第二个数据流进行处理
                return "对String类型的数据流进行处理:"+ value;
            }
        }).print().setParallelism(1);


        sEnv.execute("FlinkDemo2");

    }
}

(3)split 与 select

public class FlinkDemo3 {
    public static void main(String[] args) throws Exception {
        // 使用之前创建的并行度为1的数据源进行测试
        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());

        //把数据源中的奇数和偶数分开,进行打标签
        SplitStream<Integer> split = source1.split(new OutputSelector<Integer>() {
            public Iterable<String> select(Integer value) {
                //定义一个集合,代表该数据的所有的标签,一个数据可以有多个标签
                ArrayList<String> selector = new ArrayList<String>();
                /*
                 * String:表示给收到的数据打上标签,标签可以多个
                 * Iterable:标签可以多个
                 */

                if (value % 2 == 0) {
                    //偶数
                    selector.add("even"); //偶数
                } else {
                    //奇数
                    selector.add("odd"); //奇数
                }

                return selector;

            }
        });
        //选择所有的奇数
        split.select("odd").print().setParallelism(1);

        sEnv.execute("FlinkDemo3");


    }
}

(4)自定义分区

flink有自己的一套分区规则,但有时可能不是我们想要,自己可以定义分区规则。

public class MyPartitionerTest {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());

        SingleOutputStreamOperator<Tuple1<Integer>> data  = source1.map(new MapFunction<Integer, Tuple1<Integer>>() {
            public Tuple1<Integer> map(Integer integer) throws Exception {
                return new Tuple1<Integer>(integer);
            }
        });
        DataStream<Tuple1<Integer>> partitioner = data.partitionCustom(new MyPartitioner(), 0);
        partitioner.map(new MapFunction<Tuple1<Integer>, Integer>() {
            public Integer map(Tuple1<Integer> value) throws Exception {
                //得到数据
                Integer data = value.f0;
                long threadID = Thread.currentThread().getId();
                System.out.println("线程号:"+ threadID +"\t 数据:" + data);
                return data;
            }
        }).print().setParallelism(1);


        sEnv.execute("MyPartitionerTest");
    }
}

(5)
(6)

3、Data Sink 目的地

(1)数据保存到Redis

从kafka数据源获得数据,经过处理之后,数据保存到redis之中。

public class FlinkStreamWithKafka {
    public static void main(String[] args) throws Exception {
        // 创建一个KafkaStream
        Properties props = new Properties();
        // 指定Broker的地址
        props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.92.111:9093");
        // 消费者组
        props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "mygroup1");

        FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<String>("mydemotopic1", new SimpleStringSchema(), props);

        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source1  = sEnv.addSource(source);
        SingleOutputStreamOperator<WordCount> result = source1.flatMap(new FlatMapFunction<String, WordCount>() {
            public void flatMap(String s, Collector<WordCount> out) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    out.collect(new WordCount(word, 1));
                }
            }
        }).keyBy("word").sum("count");


        FlinkJedisPoolConfig conf = new  FlinkJedisPoolConfig.Builder()
                .setHost("192.168.92.111").setPort(6379).build();
        RedisSink<WordCount> redisSink = new RedisSink<WordCount>(conf , new MyRedisMapper());

        //把结果result输出到Redis
        result.addSink(redisSink);

        sEnv.execute("FlinkStreamWithKafka");
    }
}

创建MyRedisMapper类,做redis映射器。

public class MyRedisMapper implements RedisMapper<WordCount> {
    public RedisCommandDescription getCommandDescription() {
        return new RedisCommandDescription(RedisCommand.HSET,"myflink");
    }

    public String getKeyFromData(WordCount wordCount) {
        return wordCount.word;
    }

    public String getValueFromData(WordCount wordCount) {
        return String.valueOf(wordCount.count);
    }
}

四、高级特性

1、分布式缓存:类似Map Join,提高性能

flink的规则过滤 flink 规则引擎编排_flink的规则过滤_06


数据在每个taskmanage节点上都进行缓存一份。减少每个任务都要缓存数据占用内存。提高系统的性能

== 注意:数据是缓存到节点上 ,然后每个任务都可以使用这些数据 ==

Java代码实现:

public class DistributedCacheDemo {
    public static void main(String[] args) throws Exception {
        // 创建一个方位接口的对象:DataSet API
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //注册需要缓存的数据
        //路径可以是HDFS,也可以是本地
        //如果是HDFS,需要把HDFS的依赖包含
        env.registerCachedFile("D:\\InstallDev\\Java\\MyJavaProject\\flinkDemo\\src\\main\\java\\demo3\\data.txt"
                , "localfile");

        //执行一个简单的计算
        //创建一个DataSet
        DataSet<Integer> source = env.fromElements(1,2,3,4,5,6,7,8,9,10);

        /*
         * 需要是RichMapFunction的open方法,在初始化的时候读取缓存的数据文件
         */
        source.map(new RichMapFunction<Integer, String>() {

            String shareData ;
			/**
			 * open
             * 这个方法只会执行一次
             * 可以在这里实现一些初始化的功能
             *
             * 所以,就可以在open方法中获取广播变量数据
             *
             */
            @Override
            public void open(Configuration parameters) throws Exception {
                // 读取分布式缓存的数据
                File localfile = this.getRuntimeContext().getDistributedCache().getFile("localfile");

                List<String> lines  = FileUtils.readLines(localfile);

                //得到数据
                shareData = lines.get(0);


            }

            public String map(Integer integer) throws Exception {




                return shareData + integer;
            }
        }).print();

        // 离散处理不需要 execute方法
        // env.execute("DistributedCacheDemo");


    }
}

观察打印结果,可以看出,每个任务都共享了缓存数据。

2、并行度设置

看前面内容即可。

3、广播变量

和分布式缓存本质是一样的
区别:

  • 分布式缓存 -------> 文件
  • 广播变量 --------> 变量

Java代码实现:

public class BroadCastDemo {
    public static void main(String[] args) throws Exception {
        // 创建一个方位接口的对象:DataSet API
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //需要广播的变量: 姓名  年龄
        List<Tuple2<String, Integer>> people = new ArrayList<Tuple2<String,Integer>>();
        people.add(new Tuple2<String, Integer>("Tom",23));
        people.add(new Tuple2<String, Integer>("Mike",20));
        people.add(new Tuple2<String, Integer>("Jone",25));

        DataSet<Tuple2<String, Integer>> peopleData = env.fromCollection(people);

        //把peopleData转换成HashMap ----> 要广播的数据
        DataSet<HashMap<String,Integer>> broadCast = peopleData.map(new MapFunction<Tuple2<String,Integer>, HashMap<String,Integer>>() {

            public HashMap<String, Integer> map(Tuple2<String, Integer> value) throws Exception {
                HashMap<String, Integer> result = new HashMap<String, Integer>();
                result.put(value.f0,value.f1);
                return result;
            }
        });


        //根据姓名key,获取年龄value
        DataSet<String> source = env.fromElements("Tom","Mike","Jone");
        DataSet<String> result = source.map(new RichMapFunction<String, String>() {
            //定义一个变量保存广播变量的值
            HashMap<String,Integer> allMap = new HashMap<String, Integer>();


            public void open(Configuration parameters) throws Exception {
                //获取广播的变量
                List<HashMap<String,Integer>> broadVariable = getRuntimeContext().getBroadcastVariable("mydata");
                for(HashMap<String,Integer> x:broadVariable) {
                    allMap.putAll(x);
                }
            }


            public String map(String name) throws Exception {
                // 根据名字获取年龄
                Integer age = allMap.get(name);
                return "姓名:" + name + "\t 年龄:" + age;
            }
        }).withBroadcastSet(broadCast, "mydata");


        result.print();

    }
}

4、累加器和计数器

作用:在全局只维护一份数据
注意:累加器只有在任务执行完成后,才能得到

java代码实现:

public class AccumulatorDemo {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 创建DataSet
        DataSet<String> data = env.fromElements("Tom", "Mike", "Mary", "Jone");

        MapOperator<String, Integer> result = data.map(new RichMapFunction<String, Integer>() {

            //第一步:创建一个累加器
            private IntCounter intCounter = new IntCounter();

            @Override
            public void open(Configuration parameters) throws Exception {

                //第二步:注册一个累加器
                this.getRuntimeContext().addAccumulator("myIntCounter", intCounter);
            }

            public Integer map(String s) throws Exception {
                //第三步:实现计数
                intCounter.add(1);
                return 0;
            }
        }).setParallelism(4);

        // 必须有sink,在离线处理才不会报错
        result.writeAsText("D:\\InstallDev\\Java\\MyJavaProject\\flinkDemo\\src\\main\\java\\data");

        JobExecutionResult finalResult  = env.execute("AccumulatorDemo");
        //第四步:获取累加器的值
        Object total  = finalResult.getAccumulatorResult("myIntCounter");
        System.out.println("累加器最终值:"+total);
    }
}

5、状态管理

支持三种状态的持久化:memory、文件系统、rockdb

(1)什么是状态?state

有状态计算是指在程序计算过程中,在Flink程序内部存储计算产生的中间结果,并提供给后续Function或算子计算结果使用。状态数据可以维系在本地存储中,这里的存储可以是Flink的堆内存或者堆外内存,也可以借助第三方的存储介质,例如Flink中已经实现的RocksDB,当然用户也可以自己实现相应的缓存系统去存储状态信息,以完成更加复杂的计算逻辑。
和状态计算不同的是,无状态计算不会存储计算过程中产生的结果,也不会将结果用于下一步计算过程中,程序只会在当前的计算流程中实行计算,计算完成就输出结果,然后下一条数据接入,然后再处理。

(2)检查点checkpoint

通俗的说,就是一个定时器,我们可以设置以多长时间执行一次,让状态信息缓存到我们指定的地方。

(3)检查点的后端存储

支持三种状态的持久化:memory、文件系统、rockdb
设置HDFS为后端存储,添加依赖

<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-client</artifactId>
	<version>3.1.2</version>
	<scope>provided</scope>
</dependency>

(4)重启策略

Flink 支持不同的重启策略,以便在故障发生时控制作业如何重启。常用的重启策略:

  • 固定间隔(Fixed delay)
  • 失败率(Failure rate)
  • 无重启(No restart)

如果没有启动checkPointing,则使用无重启策略。

  • 第一种:全局配置 flink-conf.yaml
restart-strategy:fixed-delay
restart-strategy.fixed-delay.attempts:3
restart-strategy.fixed-delay.delay:10 s
  • 第二种:应用代码设置
sEnv.setRestartStrategy(RestartStrategies.fixedDelayRestart(
		  3, // 尝试重启的次数
		  Time.of(10, TimeUnit.SECONDS) // 间隔
		));

状态管理综合案例

Java代码实现

//执行计算:每三个数字进行求和
public class CountWindowWithState {

	public static void main(String[] args) throws Exception {
		StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
		
		//启用检查点功能
		sEnv.enableCheckpointing(1000);//每个秒生成一个检查点
		// 检查点的后端存储
		sEnv.setStateBackend(new FsStateBackend("hdfs://bigdata111:9000/flink/ckpt"));
		// 重启策略
		sEnv.setRestartStrategy(RestartStrategies.fixedDelayRestart(
				  3, // 尝试重启的次数
				  Time.of(10, TimeUnit.SECONDS) // 间隔
				));
		
		//创建一个keyedstream
		sEnv.fromElements(
				Tuple2.of(1, 1),
				Tuple2.of(1, 2),
				Tuple2.of(1, 3),
				Tuple2.of(1, 4),
				Tuple2.of(1, 5),
				Tuple2.of(1, 6),
				Tuple2.of(1, 7),
				Tuple2.of(1, 8),
				Tuple2.of(1, 9))
		.keyBy(0)
		.flatMap(new MyFlatMapFunction())
		.print().setParallelism(1);
		
		sEnv.execute("CountWindowWithState");
	}
}

class MyFlatMapFunction extends RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>{
	//定义一个状态
	//第一个Integer 表示个数;第二个Integer表示求和的结果
	private ValueState<Tuple2<Integer,Integer>> state;
	
	@Override
	public void open(Configuration parameters) throws Exception {
		//对状态的初始化
		ValueStateDescriptor<Tuple2<Integer,Integer>> descriptor = 
				new ValueStateDescriptor<Tuple2<Integer,Integer>>("mystate",    //状态的名字
						                                          TypeInformation.of(new TypeHint<Tuple2<Integer,Integer>>() {
																  }),     //状态类型
						                                          Tuple2.of(0, 0));
		state = getRuntimeContext().getState(descriptor);
	}

	@Override
	public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception {
		// 每三个数据进行求和
		//获取当前状态的值
		Tuple2<Integer,Integer> current = state.value();
		
		//个数加一
		current.f0 += 1; //表示的个数加一
		//值也需要累加
		current.f1 += value.f1;  //值进行累加
		
		//更新状态
		state.update(current);
		
		//判断:是否已经有了三个数据
		if(current.f0 >= 3) {
			//输出这三个元素的结果                                                                  个数                     求和的结果
			out.collect(new Tuple2<Integer, Integer>(current.f0,current.f1));
			//状态需要清空
			state.clear();
		}
	}
}

操作算子可以会产生一个中间结果,去维护管理。

6、窗口计算、水位线(乱序数据)

窗口计算前面已经提到过,这里不再赘述。
要明白水位线,先明白三个概念:

  • 事件时间(Event Time):表示是数据源产生数据的时间
  • 摄入时间(Ingestion Time):数据到达Flink的时间
  • 处理时间(Processing Time):Flink处理数据所需要的时间

执行乱序数据处理的时候,需要指定时间语义----> 一般:事件时间

所谓水位线,我是这样理解的,就是延迟触发机制,等待数据。数据源产生的数据是有序的,但是传递到flink,由于网络的带宽,可能导致这些数据是无序,但是flink又想处理有序的数据,就设置一个等待时间,这个等待时间是我们估计最迟到达flink的哪个数据。那么我们又是如何触发窗口内数据计算呢?看下图

flink的规则过滤 flink 规则引擎编排_apache_07

高水位线就像容器里面的水位,在往容器里倒水,这水位线是慢慢的上升,在Flink中,随着数据流的流动,水位线也会慢慢的升高,时间的水位线慢慢的变大,在某一时刻变多大肯定需要一个度量,就用水位线进行表示。慢慢体会。

7、allowedLateness

默认情况下:如果晚到的数据超过了Watermark允许的时间,数据将被丢弃。我们可以使用allowedLateness来对晚到的数据单独处理
核心代码

//定义侧流输出的标签
OutputTag<StationLog> lateTag = new OutputTag<StationLog>("late-Data") {};
.allowedLateness(Time.seconds(30)).sideOutputLateData(lateTag)
result.getSideOutput(lateTag).print();

窗口计算综合案例

Java代码实现

package day1113.datastream;

import java.time.Duration;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

// 按基站,每隔3秒,将过去是5秒内,通话时间最长的通话日志输出。
public class WaterMarkDemo {
	public static void main(String[] args) throws Exception {
		//得到Flink流式处理的运行环境
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		
		//指定时间的语义
		env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
		
		env.setParallelism(1);
		//设置周期性的产生水位线的时间间隔。当数据流很大的时候,如果每个事件都产生水位线,会影响性能。
		env.getConfig().setAutoWatermarkInterval(100);//默认100毫秒
		//定义侧流输出的标签
		OutputTag<StationLog> lateTag = new OutputTag<StationLog>("late-Data") {};
		
		//得到输入流
		SingleOutputStreamOperator<String> resualt = DataStreamSource<String> stream = env.socketTextStream("bigdata111", 1234);

		stream.flatMap(new FlatMapFunction<String, StationLog>() {

			public void flatMap(String data, Collector<StationLog> output) throws Exception {
				String[] words = data.split(",");
				//                           基站ID            from    to        通话时长                                                    callTime
				output.collect(new StationLog(words[0], words[1],words[2], Long.parseLong(words[3]), Long.parseLong(words[4])));
			}
		}).filter(new FilterFunction<StationLog>() {
			
			@Override
			public boolean filter(StationLog value) throws Exception {
				return value.getDuration() > 0?true:false;
			}
		}).assignTimestampsAndWatermarks(WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(3)) //表示延时3秒
				.withTimestampAssigner(new SerializableTimestampAssigner<StationLog>() {
					@Override
					public long extractTimestamp(StationLog element, long recordTimestamp) {
						return element.getCallTime(); //使用呼叫的时间,作为EventTime对应的字段
					}
				})
		).keyBy(new KeySelector<StationLog, String>(){
			@Override
			public String getKey(StationLog value) throws Exception {
				return value.getStationID();  //按照基站分组
			}}
		).timeWindow(Time.seconds(5),Time.seconds(3)) //设置时间窗口
		// 设置无法处理的数据
		.allowedLateness(Time.seconds(30)).sideOutputLateData(lateTag)
		.reduce(new MyReduceFunction(),new MyProcessWindows());
		//获取侧流输出
		//对于Watermark处理不了的时间,单独处理 -----> 直接打印
		result.getSideOutput(lateTag).print();
		resualt .print();

		env.execute();
	}
}
//用于如何处理窗口中的数据,即:找到窗口内通话时间最长的记录。
class MyReduceFunction implements ReduceFunction<StationLog> {
	@Override
	public StationLog reduce(StationLog value1, StationLog value2) throws Exception {
		// 找到通话时间最长的通话记录
		return value1.getDuration() >= value2.getDuration() ? value1 : value2;
	}
}
//窗口处理完成后,输出的结果是什么
class MyProcessWindows extends ProcessWindowFunction<StationLog, String, String, TimeWindow> {
	@Override
	public void process(String key, ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
			Iterable<StationLog> elements, Collector<String> out) throws Exception {
		StationLog maxLog = elements.iterator().next();

		StringBuffer sb = new StringBuffer();
		sb.append("窗口范围是:").append(context.window().getStart()).append("----").append(context.window().getEnd()).append("\n");;
		sb.append("基站ID:").append(maxLog.getStationID()).append("\t")
		  .append("呼叫时间:").append(maxLog.getCallTime()).append("\t")
		  .append("主叫号码:").append(maxLog.getFrom()).append("\t")
		  .append("被叫号码:")	.append(maxLog.getTo()).append("\t")
		  .append("通话时长:").append(maxLog.getDuration()).append("\n");
		out.collect(sb.toString());
	}
}

StationLog 类

//station1,18688822219,18684812319,10,1595158485855
public class StationLog {
    private String stationID;   //基站ID
    private String from;		//呼叫放
    private String to;			//被叫方
    private long duration;		//通话的持续时间
    private long callTime;		//通话的呼叫时间
    public StationLog(String stationID, String from,
                      String to, long duration,
                      long callTime) {
        this.stationID = stationID;
        this.from = from;
        this.to = to;
        this.duration = duration;
        this.callTime = callTime;
    }
    public String getStationID() {
        return stationID;
    }
    public void setStationID(String stationID) {
        this.stationID = stationID;
    }
    public long getCallTime() {
        return callTime;
    }
    public void setCallTime(long callTime) {
        this.callTime = callTime;
    }
    public String getFrom() {
        return from;
    }
    public void setFrom(String from) {
        this.from = from;
    }

    public String getTo() {
        return to;
    }
    public void setTo(String to) {
        this.to = to;
    }
    public long getDuration() {
        return duration;
    }
    public void setDuration(long duration) {
        this.duration = duration;
    }
}

所用数据

station1,18688822219,18684812319,10,1595158485855
station5,13488822219,13488822219,50,1595158490856
station5,13488822219,13488822219,50,1595158495856
station5,13488822219,13488822219,50,1595158500856
station5,13488822219,13488822219,50,1595158505856
station2,18464812121,18684812319,20,1595158507856
station3,18468481231,18464812121,30,1595158510856
station5,13488822219,13488822219,50,1595158515857
station2,18464812121,18684812319,20,1595158517857
station4,18684812319,18468481231,40,1595158521857
station0,18684812319,18688822219,0,1595158521857
station2,18464812121,18684812319,20,1595158523858
station6,18608881319,18608881319,60,1595158529858
station3,18468481231,18464812121,30,1595158532859
station4,18684812319,18468481231,40,1595158536859
station2,18464812121,18684812319,20,1595158538859
station1,18688822219,18684812319,10,1595158539859
station5,13488822219,13488822219,50,1595158544859
station4,18684812319,18468481231,40,1595158548859
station3,18468481231,18464812121,30,1595158551859
station1,18688822219,18684812319,10,1595158552859
station3,18468481231,18464812121,30,1595158555859
station0,18684812319,18688822219,0,1595158555859
station2,18464812121,18684812319,20,1595158557859
station4,18684812319,18468481231,40,1595158561859

五、数据分析引擎:Flink Table & SQL (不作为重点,举几个例子)

Flink Table & SQL不成熟,还在开发阶段(2020.11.20)
Please note that the Table API and SQL are not yet feature complete and are being actively developed. Not all operations are supported by every combination of [Table API, SQL] and [stream, batch] input.

添加依赖:

<dependency>
	<groupId>org.apache.flink</groupId>
		<artifactId>flink-table-api-java-bridge_2.11</artifactId>
		<version>1.11.0</version>
		<scope>provided</scope>
	</dependency>

	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-table-planner_2.11</artifactId>
		<version>1.11.0</version>
		<scope>provided</scope>
	</dependency>

	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-table-planner-blink_2.11</artifactId>
		<version>1.11.0</version>
		<scope>provided</scope>
	</dependency>

Java demo

(1)批处理

WordCountBatchTableAPI 类

public class WordCountBatchTableAPI {

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        //创建一个DataSet来代表处理的数据
        DataSet<String> source = env.fromElements("I love Beijing","I love China",
                "Beijing is the capital of China");
        DataSet<WordCount> input  = source.flatMap(new FlatMapFunction<String, WordCount>() {
            @Override
            public void flatMap(String s, Collector<WordCount> collector) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    collector.collect(new WordCount(word, 1));
                }
            }
        });

        // 创建运行环境
        BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
        // 将dataset转换成表
        Table table = tEnv.fromDataSet(input);
        // 处理数据
        Table data  = table.groupBy("word").select("word,frequency.sum as frequency");
        // 再转换为dataset
        DataSet<WordCount> result  = tEnv.toDataSet(data, WordCount.class);
        result.print();

    }

}

class WordCount {
    public String word;
    public long frequency;

    public WordCount() {
    }

    public WordCount(String word, int frequency) {
        this.word = word;
        this.frequency = frequency;
    }

    @Override
    public String toString() {
        return "WordCount [word=" + word + ", frequency=" + frequency + "]";
    }
}

WordCountBatchSQL类

public class WordCountBatchSQL {

	public static void main(String[] args) throws Exception {
		// 创建一个方位接口的对象:DataSet API
		ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
		
		//创建一个DataSet来代表处理的数据
		DataSet<String> source = env.fromElements("I love Beijing","I love China",
				                                  "Beijing is the capital of China");
		
		//生成一个个WordCount
		DataSet<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {

			@Override
			public void flatMap(String value, Collector<WordCount> out) throws Exception {
				// I love Beijing
				String[] words = value.split(" ");
				for(String w:words) {
					//                       k2 v2
					out.collect(new WordCount(w,1));
				}
			}
		});
		
		//创建一个Table执行环境
		BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
		
		//注册表
		tEnv.registerDataSet("WordCount", input,"word,frequency");
		
		//执行SQL
		Table table = tEnv.sqlQuery("select word,sum(frequency) as frequency from WordCount group by word");
		
		DataSet<WordCount> result = tEnv.toDataSet(table, WordCount.class);
		result.print();
	}

	//定义class代表表结构
	public static class WordCount{
		public String word;
		public long frequency;
		
		public WordCount() {}
		public WordCount(String word,int frequency) {
			this.word = word;
			this.frequency = frequency;
		}
		@Override
		public String toString() {
			return "WordCount [word=" + word + ", frequency=" + frequency + "]";
		}
	}
}

(2)流式计算

WordCountStreamTableAPI类

public class WordCountStreamTableAPI {

	public static void main(String[] args) throws Exception {
		StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
	
		//接收输入
		DataStreamSource<String> source = sEnv.socketTextStream("bigdata111", 1234);

		DataStream<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {

			@Override
			public void flatMap(String value, Collector<WordCount> out) throws Exception {
				// 数据:I love Beijing
				String[] words = value.split(" ");
				for(String word:words) {
					out.collect(new WordCount(word,1));
				}
				
			}
		});
		
		//创建一个表的运行环境
		StreamTableEnvironment stEnv = StreamTableEnvironment.create(sEnv);
		
		Table table = stEnv.fromDataStream(input,"word,frequency");
		Table result = table.groupBy("word").select("word,frequency.sum").as("word", "frequency");
		
		//输出
		stEnv.toRetractStream(result, WordCount.class).print();
		
		sEnv.execute("WordCountStreamTableAPI");
	}

	//定义class代表表结构
	public static class WordCount{
		public String word;
		public long frequency;
		
		public WordCount() {}
		public WordCount(String word,int frequency) {
			this.word = word;
			this.frequency = frequency;
		}
		@Override
		public String toString() {
			return "WordCount [word=" + word + ", frequency=" + frequency + "]";
		}
	}
}

WordCountStreamSQL类

public class WordCountStreamSQL {

	public static void main(String[] args) throws Exception {
		StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
		
		//接收输入
		DataStreamSource<String> source = sEnv.socketTextStream("bigdata111", 1234);

		DataStream<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {

			@Override
			public void flatMap(String value, Collector<WordCount> out) throws Exception {
				// 数据:I love Beijing
				String[] words = value.split(" ");
				for(String word:words) {
					out.collect(new WordCount(word,1));
				}
				
			}
		});
		
		//创建一个表的运行环境
		StreamTableEnvironment stEnv = StreamTableEnvironment.create(sEnv);
		Table table = stEnv.fromDataStream(input,"word,frequency");
		
		//执行SQL
		Table result = stEnv.sqlQuery("select word,sum(frequency) as frequency from " + table + " group by word");
		stEnv.toRetractStream(result, WordCount.class).print();
		
		sEnv.execute("WordCountStreamSQL");
	}
	//定义class代表表结构
	public static class WordCount{
		public String word;
		public long frequency;
		
		public WordCount() {}
		public WordCount(String word,int frequency) {
			this.word = word;
			this.frequency = frequency;
		}
		@Override
		public String toString() {
			return "WordCount [word=" + word + ", frequency=" + frequency + "]";
		}
	}
}