一、Flink基础
1、什么是Flink?数据模型、体系架构、生态圈
官方解释:
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Flink中处理的两种数据集合:
(1)Unbounded data stream:无边界的数据集 —> 流式计算、实时计算 —> Flink DataStream API
(2)bounded data stream:有边界的数据集 —> 离线计算、批处理 —> Flink DataSet API
Flink的体系架构:
flink的体系架构和我们前面所提到的spark,storm等,都是非常相似的,都是采用主从架构的思想。主节点为:jobManager、从节点:taskManager。每个从节点开辟进程,进程里面的task为最小的执行任务单位。Flink的生态圈:
从上图可以看出,Flink也分为离线计算和实时计算。
2、部署Flink
(1)Standalone模式
tar -zxvf flink-1.11.0-bin-scala_2.12.tgz -C ~/training/
核心配置文件 conf/flink-conf.yaml
Web UI:端口 8081
- 伪分布的环境
直接运行就行,flink已经默认配置好
bin/start-cluster.sh - 全分布的环境
(2)Flink on Yarn
把Flink中的任务放到yarn上进行执行。有两种模式:我们一般采用模式二。
注意:我们要使用Hadoop,必须把flink-shaded-hadoop-2-uber-2.8.3-10.0.jar 包添加到lib目录下,因为在flink1.10版本后,flink把Hadoop相关依赖去除。
(模式一)内存集中管理的模式
yarn初始化一个集群,开辟指定的资源,我们提交job都在这个flink yarn-session中,也就是不管有多少个job,这些job都会共用yarn中申请的资源。
bin/yarn-session.sh -n 2 -jm 1024 -tm 1024 -d
(模式二)内存Job的管理模式:
在yarn中,每次提交job都会创建一个新的flink集群,任务之间相互独立,互不影响并且方便管理,任务执行完后,创建的集群也会消失。
bin/flink run -m yarn-cluster -p 1 -yjm 1024 -ytm 1024 examples/batch/WordCount.jar
注意:-p 1 指任务的并行度
(3)HA模式:基于ZooKeeper
3、执行Flink的任务:WordCount
Flink中example中也提供一些例子。可以去查看一下。
(1)离线计算-批处理
bin/flink run examples/batch/WordCount.jar -input hdfs://bigdata111:9000/input/data.txt -output hdfs://bigdata111:9000/flink/wc
(2)流式计算-实时计算
bin/flink run examples/streaming/SocketWindowWordCount.jar --port 1234
4、开发自己的Flink程序:WordCount(Java)
添加依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.11.0</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.11.0</version>
</dependency>
(1)离线计算-批处理
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
/**
* @author
* @date 2020/11/21-${TIEM}
*/
public class WordCountBatchExample {
public static void main(String[] args) throws Exception{
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
//创建一个DataSet来代表处理的数据
DataSource<String> data = env.fromElements("i love beijing",
"i love chain ", "chain is the captial of the beijing");
data.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
String[] words = s.split(" ");
for (String word : words) {
collector.collect(new Tuple2<String, Integer>(word,1));
}
}
}).groupBy(0).sum(1).print();
}
}
(2)流式计算-实时计算
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
/**
* @author
* @date 2020/11/21-${TIEM}
*/
public class WordCountStreamExample {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment sen = StreamExecutionEnvironment.getExecutionEnvironment();
//接收输入
DataStreamSource<String> source = sen.socketTextStream("bigdata111", 1234);
source.flatMap(new FlatMapFunction<String, WordCount>() {
public void flatMap(String s, Collector<WordCount> collector) throws Exception {
String[] words = s.split(" ");
for (String word : words) {
collector.collect(new WordCount(word,1));
}
}
}).keyBy("word").sum("count").print().setParallelism(1);
sen.execute("WordCountStreamExample");
}
}
在Linux bigdata111上启动 :nc -l 1234
5、对比:Storm、Spark Streaming、Flink的技术特点
二、Flink DataSet API ----> 离线计算-批处理
Mysql读取、写入
添加依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-jdbc_2.11</artifactId>
<version>1.9.1</version>
</dependency>
Java实现
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//读mysql
DataSource<Row> dataSource = env.createInput(JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.jdbc.Driver")
.setDBUrl("jdbc:mysql://localhost:3306")
.setUsername("root")
.setPassword("root")
.setQuery("select id,dq from flink.ajxx_xs ")
.setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO))
.finish());
final BatchTableEnvironment tableEnv = BatchTableEnvironment.create(env);
tableEnv.registerDataSet("ods_tb01", dataSource);
Table query = tableEnv.sqlQuery("select * from ods_tb01");
DataSet<Row> result = tableEnv.toDataSet(query, Row.class);
result.print();
result.output(JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername("com.mysql.jdbc.Driver")
.setDBUrl("jdbc:mysql://localhost:3306")
.setUsername("root")
.setPassword("root")
.setQuery("insert into flink.ajxx_xs2 (id,dq) values (?,?)")
.setSqlTypes(new int[]{Types.INTEGER, Types.NCHAR})
.finish());
env.execute("flink-test");
算子介绍
算子 | 解释 |
map | 输入一个元素,返回一个元素,中间可以做清洗转换操作 |
FlatMap | 输入一个元素,可以返回0个,一个或对个元素 |
map与flatmap区别 | map输入一个数据只能有一次输出,flatmap输入一个数据,可以有多个输出 |
MapPartition | 类似map,一次处理一个分区的数据 |
Filter | 过滤函数,对传入的数据进行判断,符合条件的留下 |
Reduce | 对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,返回一个新的值 |
Aggregate | 聚合操作,sum、max、min等 |
distinct | 去重之后的元素 |
join | 内连接 |
outerJoin | 外连接 |
cross | 获取两个数据集的笛卡尔积 |
Union | 返回两个数据集的总和,数据类型必须一致 |
first-n | 获取集合中前 N 个元素 |
Sort Partition | 在本地数据集的所有分区进行排序,通过SortPartition()的连接调用来完成对多个字段的排序 |
(1)Map、FlatMap与MapPartition,注意区别
public class FlinkDemo1 {
public static void main(String[] args) throws Exception {
// 准备数据
ArrayList<String> data = new ArrayList<String>();
data.add("I love Beijing");
data.add("I love China");
data.add("Beijing is the capital of China");
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSource<String> dataSource = env.fromCollection(data);
/*
* 输入一个元素,返回一个元素,中间可以做清洗转换操作
*
* */
MapOperator<String, List<String>> map = dataSource.map(new MapFunction<String, List<String>>() {
public List<String> map(String value) throws Exception {
String[] words = value.split(" ");
//创建一个List
List<String> list = new ArrayList<String>();
for (String w : words) {
list.add(w);
}
list.add("-----------");
return list;
}
});
map.print();
System.out.println("**************************");
/*
*输入一个元素,可以返回0个,一个或对个元素
*
* */
dataSource.flatMap(new FlatMapFunction<String, String>() {
public void flatMap(String value, Collector<String> out) throws Exception {
String[] words = value.split(" ");
for(String w:words) {
out.collect(w);
}
}
}).print();
System.out.println("************************");
/*
* mappartion
* 拿到的是这个分区的元素,常用于连接数据库等操作
* */
dataSource.mapPartition(new MapPartitionFunction<String, String>() {
public void mapPartition(Iterable<String> iterable, Collector<String> out) throws Exception {
Iterator<String> ite = iterable.iterator();
while (ite.hasNext()){
String next = ite.next();
String[] s = next.split(" ");
for (String s1 : s) {
out.collect(s1);
}
}
}
}).print();
}
}
(2)Filter与Distinct
public class FlinkDemo2 {
public static void main(String[] args) throws Exception {
// 准备数据
ArrayList<String> data = new ArrayList<String>();
data.add("I love Beijing");
data.add("I love China");
data.add("Beijing is the capital of China");
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSource<String> dataSource = env.fromCollection(data);
final FlatMapOperator<String, String> word = dataSource.flatMap(new FlatMapFunction<String, String>() {
public void flatMap(String s, Collector<String> collector) throws Exception {
String[] words = s.split(" ");
for (String word : words) {
collector.collect(word);
}
}
});
// 筛选长度大于3
FilterOperator<String> filterword = word.filter(new FilterFunction<String>() {
public boolean filter(String s) throws Exception {
return s.length() >= 3 ? true : false;
}
});
filterword.distinct().print();
}
}
(3)Join操作,内连接
内连接:将两个表连接,仅仅返回匹配(主键与外键的匹配)条件的行的连接成为内连接。
public class FlinkDemo3 {
public static void main(String[] args) throws Exception {
//创建第一张表:用户表(用户ID、姓名)
ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
data1.add(new Tuple2<Integer, String>(1,"Tom"));
data1.add(new Tuple2<Integer, String>(2,"Mike"));
data1.add(new Tuple2<Integer, String>(3,"Mary"));
data1.add(new Tuple2<Integer, String>(4,"Jone"));
//创建第二张表:用户所在城市(用户ID、城市)
ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
data2.add(new Tuple2<Integer, String>(1,"北京"));
data2.add(new Tuple2<Integer, String>(2,"上海"));
data2.add(new Tuple2<Integer, String>(3,"北京"));
data2.add(new Tuple2<Integer, String>(4,"深圳"));
// 创建执行的运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);
//执行join操作
//where(0).equalTo(0) 表示:使用第一张表的第一个列,连接第二张表的第一个列
//相当于 where table1.userID = table2.userID
table1.join(table2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>,
Tuple2<Integer, String>,
Tuple2<String, String>>() {
public Tuple2<String, String> join(Tuple2<Integer, String> t1, Tuple2<Integer, String> t2) throws Exception {
return new Tuple2<String, String>(t1.f1,t2.f1);
}
}).print();
}
}
(4)外链接、全连接操作
外连接:与内连接相反,把没有匹配(主键与外键的匹配)成功的行也进行返回。
public class FlinkDemo4 {
public static void main(String[] args) throws Exception {
//创建第一张表:用户表(用户ID、姓名)
ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
data1.add(new Tuple2<Integer, String>(1,"Tom"));
data1.add(new Tuple2<Integer, String>(3,"Mary"));
data1.add(new Tuple2<Integer, String>(4,"Jone"));
//创建第二张表:用户所在城市(用户ID、城市)
ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
data2.add(new Tuple2<Integer, String>(1,"北京"));
data2.add(new Tuple2<Integer, String>(2,"上海"));
data2.add(new Tuple2<Integer, String>(4,"深圳"));
// 创建执行的运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);
// 左外连接
System.out.println("左外连接:");
table1.leftOuterJoin(table2).where(0).equalTo(0)
.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer,String,String>>() {
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
return right == null ? new Tuple3<Integer, String, String>(left.f0 , left.f1 , null) :
new Tuple3<Integer, String, String>(left.f0,left.f1,right.f1);
}
}).print();
// 右外连接
System.out.println("右外连接:");
table1.rightOuterJoin(table2).where(0).equalTo(0)
.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer,String,String>>() {
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
return left == null ? new Tuple3<Integer, String, String>(right.f0 , right.f1 , null) :
new Tuple3<Integer, String, String>(left.f0,left.f1,right.f1);
}
}).print();
// 全连接
System.out.println("全连接");
table1.fullOuterJoin(table2).where(0).equalTo(0)
.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
if(left == null) {
return new Tuple3<Integer, String, String>(right.f0,null,right.f1);
}else if(right == null) {
return new Tuple3<Integer, String, String>(left.f0,left.f1,null);
}else {
return new Tuple3<Integer, String, String>(right.f0,left.f1,right.f1);
}
}
}).print();
}
}
(5)笛卡尔积
public class FlinkDemo5 {
public static void main(String[] args) throws Exception {
//创建第一张表:用户表(用户ID、姓名)
ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
data1.add(new Tuple2<Integer, String>(1,"Tom"));
data1.add(new Tuple2<Integer, String>(2,"Mike"));
data1.add(new Tuple2<Integer, String>(3,"Mary"));
data1.add(new Tuple2<Integer, String>(4,"Jone"));
//创建第二张表:用户所在城市(用户ID、城市)
ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
data2.add(new Tuple2<Integer, String>(1,"北京"));
data2.add(new Tuple2<Integer, String>(2,"上海"));
data2.add(new Tuple2<Integer, String>(3,"北京"));
data2.add(new Tuple2<Integer, String>(4,"深圳"));
// 创建执行的运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);
// 笛卡尔积
table1.cross(table2).print();
}
}
(6)First-N 、 sortPartition(SQL:Top-N)
注意:导包别导错了
public class FlinkDemo6 {
public static void main(String[] args) throws Exception {
//Tuple3: 姓名 薪水 部门号
ArrayList<Tuple3<String,Integer,Integer>> data1 = new ArrayList<Tuple3<String,Integer,Integer>>();
data1.add(new Tuple3<String,Integer,Integer>("Tom",1000,10));
data1.add(new Tuple3<String,Integer,Integer>("Mary",2000,20));
data1.add(new Tuple3<String,Integer,Integer>("Mike",1500,30));
data1.add(new Tuple3<String,Integer,Integer>("Jone",1800,10));
// 创建执行的运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//构造一个DataSet
DataSet<Tuple3<String,Integer,Integer>> table = env.fromCollection(data1);
//取出前三条记录
table.first(3).print();
System.out.println("********************");
//先按照部门号排序,再按照薪水排序
table.sortPartition(2, Order.ASCENDING).sortPartition(1,Order.DESCENDING).print();
}
}
三、Flink DataStream API ----> 流式计算-实时计算
1、DataSource 数据源
(1)自定义的数据源,实现接口:
SourceFunction:并行度1
ParalleSourceFunction:多并行度
创建MySingleDataSourceTest 类进行测试
public class MySingleDataSourceTest {
public static void main(String[] args) throws Exception {
// 创建一个执行环境
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
//指定单并行度的数据源
DataStreamSource<Integer> source = sEnv.addSource( new MySingleDataSource());
DataStream<Integer> step1 = source.map(new MapFunction<Integer, Integer>() {
public Integer map(Integer value) throws Exception {
System.out.println("收到的数据是:"+ value);
return value*10;
}
});
//每隔两秒做一次求和
step1.timeWindowAll(Time.seconds(2)).sum(0).setParallelism(1).print();
sEnv.execute("MySingleDataSourceTest");
}
}
创建MySingleDataSource 类,集成SourceFunction接口,实现自定义数据源。
public class MySingleDataSource implements SourceFunction<Integer> {
//计数器
private Integer count = 1;
//开关
private boolean isRunning = true;
public void run(SourceContext<Integer> ctx) throws Exception {
// 如何产生数据
while(isRunning) {
//输出数据
ctx.collect(count);
//每隔一秒
Thread.sleep(1000);
//自增
count ++;
}
}
public void cancel() {
// 如何停止产生数据
isRunning = false;
}
}
(2)kafka数据源
Java代码如下
public class FlinkStreamWithKafka {
public static void main(String[] args) throws Exception {
// 创建一个KafkaStream
Properties props = new Properties();
// 指定Broker的地址
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.92.111:9093");
// 消费者组
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "mygroup1");
FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<String>("mydemotopic1", new SimpleStringSchema(), props);
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> source1 = sEnv.addSource(source);
source1.print();
sEnv.execute("FlinkStreamWithKafka");
}
}
2、转换操作 Transformation
算子 | 解释 |
map | 输一个元素,返回一个元素,中间做清洗转化操作 |
flatmap | 输入一个元素,可以返回0个,一个或对个元素 |
filter | 过滤函数,对传入的数据进行判断,符合条件的留下 |
keyBy | 指定key进行分组,相同的key的数据会进入同一个分区 |
reduce | 对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,返回一个新的值 |
aggregations | sum()、min() 、 max()等 |
window | 后面详解 |
union | 返回两个数据集的总和,数据类型必须一致 |
connect | 和union类似,但是只能连接两个流,两个流的数据类型可以不同,会对两个流中数据应用不同的处理方法 |
coMap 、 CoFlatMap | 在connectedStreams中需要使用这种函数,类似map和flatmap |
split | 根据规则吧一个数据流切分为多个流,打标签 |
select | 和split配合使用,选择切分后的流 , 选择标签组合一个新的流 |
(1)union 合并两个流
public class FlinkDemo1 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
//创建两个DataStream Source
DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());
DataStreamSource<Integer> source2 = sEnv.addSource(new MySingleDataSource());
//执行union计算:并集,两个流的数据合并到一起
DataStream<Integer> resault = source1.union(source2);
resault.print().setParallelism(1);
sEnv.execute("FlinkDemo1");
}
}
(2)connect
public class FlinkDemo2 {
public static void main(String[] args) throws Exception {
// 使用之前创建的并行度为1的数据源进行测试
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
//创建两个DataStream Source
DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());
DataStream<String> source2 = sEnv.addSource(new MySingleDataSource())
.map(new MapFunction<Integer, String>() {
public String map(Integer value) throws Exception {
// 把Integer转换为String
return "String"+value;
}
});
// 可以包括不同的数据类型
ConnectedStreams<Integer, String> connect = source1.connect(source2);
// 处理不同的数据类型,返回不同的结果
connect.map(new CoMapFunction<Integer, String, Object>() {
public Object map1(Integer value) throws Exception {
// 对第一个数据流进行处理
return "对Integer类型的数据流进行处理:"+ value;
}
public Object map2(String value) throws Exception {
// 对第二个数据流进行处理
return "对String类型的数据流进行处理:"+ value;
}
}).print().setParallelism(1);
sEnv.execute("FlinkDemo2");
}
}
(3)split 与 select
public class FlinkDemo3 {
public static void main(String[] args) throws Exception {
// 使用之前创建的并行度为1的数据源进行测试
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());
//把数据源中的奇数和偶数分开,进行打标签
SplitStream<Integer> split = source1.split(new OutputSelector<Integer>() {
public Iterable<String> select(Integer value) {
//定义一个集合,代表该数据的所有的标签,一个数据可以有多个标签
ArrayList<String> selector = new ArrayList<String>();
/*
* String:表示给收到的数据打上标签,标签可以多个
* Iterable:标签可以多个
*/
if (value % 2 == 0) {
//偶数
selector.add("even"); //偶数
} else {
//奇数
selector.add("odd"); //奇数
}
return selector;
}
});
//选择所有的奇数
split.select("odd").print().setParallelism(1);
sEnv.execute("FlinkDemo3");
}
}
(4)自定义分区
flink有自己的一套分区规则,但有时可能不是我们想要,自己可以定义分区规则。
public class MyPartitionerTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());
SingleOutputStreamOperator<Tuple1<Integer>> data = source1.map(new MapFunction<Integer, Tuple1<Integer>>() {
public Tuple1<Integer> map(Integer integer) throws Exception {
return new Tuple1<Integer>(integer);
}
});
DataStream<Tuple1<Integer>> partitioner = data.partitionCustom(new MyPartitioner(), 0);
partitioner.map(new MapFunction<Tuple1<Integer>, Integer>() {
public Integer map(Tuple1<Integer> value) throws Exception {
//得到数据
Integer data = value.f0;
long threadID = Thread.currentThread().getId();
System.out.println("线程号:"+ threadID +"\t 数据:" + data);
return data;
}
}).print().setParallelism(1);
sEnv.execute("MyPartitionerTest");
}
}
(5)
(6)
3、Data Sink 目的地
(1)数据保存到Redis
从kafka数据源获得数据,经过处理之后,数据保存到redis之中。
public class FlinkStreamWithKafka {
public static void main(String[] args) throws Exception {
// 创建一个KafkaStream
Properties props = new Properties();
// 指定Broker的地址
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.92.111:9093");
// 消费者组
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "mygroup1");
FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<String>("mydemotopic1", new SimpleStringSchema(), props);
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> source1 = sEnv.addSource(source);
SingleOutputStreamOperator<WordCount> result = source1.flatMap(new FlatMapFunction<String, WordCount>() {
public void flatMap(String s, Collector<WordCount> out) throws Exception {
String[] words = s.split(" ");
for (String word : words) {
out.collect(new WordCount(word, 1));
}
}
}).keyBy("word").sum("count");
FlinkJedisPoolConfig conf = new FlinkJedisPoolConfig.Builder()
.setHost("192.168.92.111").setPort(6379).build();
RedisSink<WordCount> redisSink = new RedisSink<WordCount>(conf , new MyRedisMapper());
//把结果result输出到Redis
result.addSink(redisSink);
sEnv.execute("FlinkStreamWithKafka");
}
}
创建MyRedisMapper类,做redis映射器。
public class MyRedisMapper implements RedisMapper<WordCount> {
public RedisCommandDescription getCommandDescription() {
return new RedisCommandDescription(RedisCommand.HSET,"myflink");
}
public String getKeyFromData(WordCount wordCount) {
return wordCount.word;
}
public String getValueFromData(WordCount wordCount) {
return String.valueOf(wordCount.count);
}
}
四、高级特性
1、分布式缓存:类似Map Join,提高性能
数据在每个taskmanage节点上都进行缓存一份。减少每个任务都要缓存数据占用内存。提高系统的性能
== 注意:数据是缓存到节点上 ,然后每个任务都可以使用这些数据 ==
Java代码实现:
public class DistributedCacheDemo {
public static void main(String[] args) throws Exception {
// 创建一个方位接口的对象:DataSet API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//注册需要缓存的数据
//路径可以是HDFS,也可以是本地
//如果是HDFS,需要把HDFS的依赖包含
env.registerCachedFile("D:\\InstallDev\\Java\\MyJavaProject\\flinkDemo\\src\\main\\java\\demo3\\data.txt"
, "localfile");
//执行一个简单的计算
//创建一个DataSet
DataSet<Integer> source = env.fromElements(1,2,3,4,5,6,7,8,9,10);
/*
* 需要是RichMapFunction的open方法,在初始化的时候读取缓存的数据文件
*/
source.map(new RichMapFunction<Integer, String>() {
String shareData ;
/**
* open
* 这个方法只会执行一次
* 可以在这里实现一些初始化的功能
*
* 所以,就可以在open方法中获取广播变量数据
*
*/
@Override
public void open(Configuration parameters) throws Exception {
// 读取分布式缓存的数据
File localfile = this.getRuntimeContext().getDistributedCache().getFile("localfile");
List<String> lines = FileUtils.readLines(localfile);
//得到数据
shareData = lines.get(0);
}
public String map(Integer integer) throws Exception {
return shareData + integer;
}
}).print();
// 离散处理不需要 execute方法
// env.execute("DistributedCacheDemo");
}
}
观察打印结果,可以看出,每个任务都共享了缓存数据。
2、并行度设置
看前面内容即可。
3、广播变量
和分布式缓存本质是一样的
区别:
- 分布式缓存 -------> 文件
- 广播变量 --------> 变量
Java代码实现:
public class BroadCastDemo {
public static void main(String[] args) throws Exception {
// 创建一个方位接口的对象:DataSet API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//需要广播的变量: 姓名 年龄
List<Tuple2<String, Integer>> people = new ArrayList<Tuple2<String,Integer>>();
people.add(new Tuple2<String, Integer>("Tom",23));
people.add(new Tuple2<String, Integer>("Mike",20));
people.add(new Tuple2<String, Integer>("Jone",25));
DataSet<Tuple2<String, Integer>> peopleData = env.fromCollection(people);
//把peopleData转换成HashMap ----> 要广播的数据
DataSet<HashMap<String,Integer>> broadCast = peopleData.map(new MapFunction<Tuple2<String,Integer>, HashMap<String,Integer>>() {
public HashMap<String, Integer> map(Tuple2<String, Integer> value) throws Exception {
HashMap<String, Integer> result = new HashMap<String, Integer>();
result.put(value.f0,value.f1);
return result;
}
});
//根据姓名key,获取年龄value
DataSet<String> source = env.fromElements("Tom","Mike","Jone");
DataSet<String> result = source.map(new RichMapFunction<String, String>() {
//定义一个变量保存广播变量的值
HashMap<String,Integer> allMap = new HashMap<String, Integer>();
public void open(Configuration parameters) throws Exception {
//获取广播的变量
List<HashMap<String,Integer>> broadVariable = getRuntimeContext().getBroadcastVariable("mydata");
for(HashMap<String,Integer> x:broadVariable) {
allMap.putAll(x);
}
}
public String map(String name) throws Exception {
// 根据名字获取年龄
Integer age = allMap.get(name);
return "姓名:" + name + "\t 年龄:" + age;
}
}).withBroadcastSet(broadCast, "mydata");
result.print();
}
}
4、累加器和计数器
作用:在全局只维护一份数据
注意:累加器只有在任务执行完成后,才能得到
java代码实现:
public class AccumulatorDemo {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// 创建DataSet
DataSet<String> data = env.fromElements("Tom", "Mike", "Mary", "Jone");
MapOperator<String, Integer> result = data.map(new RichMapFunction<String, Integer>() {
//第一步:创建一个累加器
private IntCounter intCounter = new IntCounter();
@Override
public void open(Configuration parameters) throws Exception {
//第二步:注册一个累加器
this.getRuntimeContext().addAccumulator("myIntCounter", intCounter);
}
public Integer map(String s) throws Exception {
//第三步:实现计数
intCounter.add(1);
return 0;
}
}).setParallelism(4);
// 必须有sink,在离线处理才不会报错
result.writeAsText("D:\\InstallDev\\Java\\MyJavaProject\\flinkDemo\\src\\main\\java\\data");
JobExecutionResult finalResult = env.execute("AccumulatorDemo");
//第四步:获取累加器的值
Object total = finalResult.getAccumulatorResult("myIntCounter");
System.out.println("累加器最终值:"+total);
}
}
5、状态管理
支持三种状态的持久化:memory、文件系统、rockdb
(1)什么是状态?state
有状态计算是指在程序计算过程中,在Flink程序内部存储计算产生的中间结果,并提供给后续Function或算子计算结果使用。状态数据可以维系在本地存储中,这里的存储可以是Flink的堆内存或者堆外内存,也可以借助第三方的存储介质,例如Flink中已经实现的RocksDB,当然用户也可以自己实现相应的缓存系统去存储状态信息,以完成更加复杂的计算逻辑。
和状态计算不同的是,无状态计算不会存储计算过程中产生的结果,也不会将结果用于下一步计算过程中,程序只会在当前的计算流程中实行计算,计算完成就输出结果,然后下一条数据接入,然后再处理。
(2)检查点checkpoint
通俗的说,就是一个定时器,我们可以设置以多长时间执行一次,让状态信息缓存到我们指定的地方。
(3)检查点的后端存储
支持三种状态的持久化:memory、文件系统、rockdb
设置HDFS为后端存储,添加依赖
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.2</version>
<scope>provided</scope>
</dependency>
(4)重启策略
Flink 支持不同的重启策略,以便在故障发生时控制作业如何重启。常用的重启策略:
- 固定间隔(Fixed delay)
- 失败率(Failure rate)
- 无重启(No restart)
如果没有启动checkPointing,则使用无重启策略。
- 第一种:全局配置 flink-conf.yaml
restart-strategy:fixed-delay
restart-strategy.fixed-delay.attempts:3
restart-strategy.fixed-delay.delay:10 s
- 第二种:应用代码设置
sEnv.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // 尝试重启的次数
Time.of(10, TimeUnit.SECONDS) // 间隔
));
状态管理综合案例
Java代码实现
//执行计算:每三个数字进行求和
public class CountWindowWithState {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
//启用检查点功能
sEnv.enableCheckpointing(1000);//每个秒生成一个检查点
// 检查点的后端存储
sEnv.setStateBackend(new FsStateBackend("hdfs://bigdata111:9000/flink/ckpt"));
// 重启策略
sEnv.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // 尝试重启的次数
Time.of(10, TimeUnit.SECONDS) // 间隔
));
//创建一个keyedstream
sEnv.fromElements(
Tuple2.of(1, 1),
Tuple2.of(1, 2),
Tuple2.of(1, 3),
Tuple2.of(1, 4),
Tuple2.of(1, 5),
Tuple2.of(1, 6),
Tuple2.of(1, 7),
Tuple2.of(1, 8),
Tuple2.of(1, 9))
.keyBy(0)
.flatMap(new MyFlatMapFunction())
.print().setParallelism(1);
sEnv.execute("CountWindowWithState");
}
}
class MyFlatMapFunction extends RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>{
//定义一个状态
//第一个Integer 表示个数;第二个Integer表示求和的结果
private ValueState<Tuple2<Integer,Integer>> state;
@Override
public void open(Configuration parameters) throws Exception {
//对状态的初始化
ValueStateDescriptor<Tuple2<Integer,Integer>> descriptor =
new ValueStateDescriptor<Tuple2<Integer,Integer>>("mystate", //状态的名字
TypeInformation.of(new TypeHint<Tuple2<Integer,Integer>>() {
}), //状态类型
Tuple2.of(0, 0));
state = getRuntimeContext().getState(descriptor);
}
@Override
public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception {
// 每三个数据进行求和
//获取当前状态的值
Tuple2<Integer,Integer> current = state.value();
//个数加一
current.f0 += 1; //表示的个数加一
//值也需要累加
current.f1 += value.f1; //值进行累加
//更新状态
state.update(current);
//判断:是否已经有了三个数据
if(current.f0 >= 3) {
//输出这三个元素的结果 个数 求和的结果
out.collect(new Tuple2<Integer, Integer>(current.f0,current.f1));
//状态需要清空
state.clear();
}
}
}
操作算子可以会产生一个中间结果,去维护管理。
6、窗口计算、水位线(乱序数据)
窗口计算前面已经提到过,这里不再赘述。
要明白水位线,先明白三个概念:
- 事件时间(Event Time):表示是数据源产生数据的时间
- 摄入时间(Ingestion Time):数据到达Flink的时间
- 处理时间(Processing Time):Flink处理数据所需要的时间
执行乱序数据处理的时候,需要指定时间语义----> 一般:事件时间
所谓水位线,我是这样理解的,就是延迟触发机制,等待数据。数据源产生的数据是有序的,但是传递到flink,由于网络的带宽,可能导致这些数据是无序,但是flink又想处理有序的数据,就设置一个等待时间,这个等待时间是我们估计最迟到达flink的哪个数据。那么我们又是如何触发窗口内数据计算呢?看下图
高水位线就像容器里面的水位,在往容器里倒水,这水位线是慢慢的上升,在Flink中,随着数据流的流动,水位线也会慢慢的升高,时间的水位线慢慢的变大,在某一时刻变多大肯定需要一个度量,就用水位线进行表示。慢慢体会。
7、allowedLateness
默认情况下:如果晚到的数据超过了Watermark允许的时间,数据将被丢弃。我们可以使用allowedLateness来对晚到的数据单独处理
核心代码
//定义侧流输出的标签
OutputTag<StationLog> lateTag = new OutputTag<StationLog>("late-Data") {};
.allowedLateness(Time.seconds(30)).sideOutputLateData(lateTag)
result.getSideOutput(lateTag).print();
窗口计算综合案例
Java代码实现
package day1113.datastream;
import java.time.Duration;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
// 按基站,每隔3秒,将过去是5秒内,通话时间最长的通话日志输出。
public class WaterMarkDemo {
public static void main(String[] args) throws Exception {
//得到Flink流式处理的运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//指定时间的语义
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
//设置周期性的产生水位线的时间间隔。当数据流很大的时候,如果每个事件都产生水位线,会影响性能。
env.getConfig().setAutoWatermarkInterval(100);//默认100毫秒
//定义侧流输出的标签
OutputTag<StationLog> lateTag = new OutputTag<StationLog>("late-Data") {};
//得到输入流
SingleOutputStreamOperator<String> resualt = DataStreamSource<String> stream = env.socketTextStream("bigdata111", 1234);
stream.flatMap(new FlatMapFunction<String, StationLog>() {
public void flatMap(String data, Collector<StationLog> output) throws Exception {
String[] words = data.split(",");
// 基站ID from to 通话时长 callTime
output.collect(new StationLog(words[0], words[1],words[2], Long.parseLong(words[3]), Long.parseLong(words[4])));
}
}).filter(new FilterFunction<StationLog>() {
@Override
public boolean filter(StationLog value) throws Exception {
return value.getDuration() > 0?true:false;
}
}).assignTimestampsAndWatermarks(WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(3)) //表示延时3秒
.withTimestampAssigner(new SerializableTimestampAssigner<StationLog>() {
@Override
public long extractTimestamp(StationLog element, long recordTimestamp) {
return element.getCallTime(); //使用呼叫的时间,作为EventTime对应的字段
}
})
).keyBy(new KeySelector<StationLog, String>(){
@Override
public String getKey(StationLog value) throws Exception {
return value.getStationID(); //按照基站分组
}}
).timeWindow(Time.seconds(5),Time.seconds(3)) //设置时间窗口
// 设置无法处理的数据
.allowedLateness(Time.seconds(30)).sideOutputLateData(lateTag)
.reduce(new MyReduceFunction(),new MyProcessWindows());
//获取侧流输出
//对于Watermark处理不了的时间,单独处理 -----> 直接打印
result.getSideOutput(lateTag).print();
resualt .print();
env.execute();
}
}
//用于如何处理窗口中的数据,即:找到窗口内通话时间最长的记录。
class MyReduceFunction implements ReduceFunction<StationLog> {
@Override
public StationLog reduce(StationLog value1, StationLog value2) throws Exception {
// 找到通话时间最长的通话记录
return value1.getDuration() >= value2.getDuration() ? value1 : value2;
}
}
//窗口处理完成后,输出的结果是什么
class MyProcessWindows extends ProcessWindowFunction<StationLog, String, String, TimeWindow> {
@Override
public void process(String key, ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements, Collector<String> out) throws Exception {
StationLog maxLog = elements.iterator().next();
StringBuffer sb = new StringBuffer();
sb.append("窗口范围是:").append(context.window().getStart()).append("----").append(context.window().getEnd()).append("\n");;
sb.append("基站ID:").append(maxLog.getStationID()).append("\t")
.append("呼叫时间:").append(maxLog.getCallTime()).append("\t")
.append("主叫号码:").append(maxLog.getFrom()).append("\t")
.append("被叫号码:") .append(maxLog.getTo()).append("\t")
.append("通话时长:").append(maxLog.getDuration()).append("\n");
out.collect(sb.toString());
}
}
StationLog 类
//station1,18688822219,18684812319,10,1595158485855
public class StationLog {
private String stationID; //基站ID
private String from; //呼叫放
private String to; //被叫方
private long duration; //通话的持续时间
private long callTime; //通话的呼叫时间
public StationLog(String stationID, String from,
String to, long duration,
long callTime) {
this.stationID = stationID;
this.from = from;
this.to = to;
this.duration = duration;
this.callTime = callTime;
}
public String getStationID() {
return stationID;
}
public void setStationID(String stationID) {
this.stationID = stationID;
}
public long getCallTime() {
return callTime;
}
public void setCallTime(long callTime) {
this.callTime = callTime;
}
public String getFrom() {
return from;
}
public void setFrom(String from) {
this.from = from;
}
public String getTo() {
return to;
}
public void setTo(String to) {
this.to = to;
}
public long getDuration() {
return duration;
}
public void setDuration(long duration) {
this.duration = duration;
}
}
所用数据
station1,18688822219,18684812319,10,1595158485855
station5,13488822219,13488822219,50,1595158490856
station5,13488822219,13488822219,50,1595158495856
station5,13488822219,13488822219,50,1595158500856
station5,13488822219,13488822219,50,1595158505856
station2,18464812121,18684812319,20,1595158507856
station3,18468481231,18464812121,30,1595158510856
station5,13488822219,13488822219,50,1595158515857
station2,18464812121,18684812319,20,1595158517857
station4,18684812319,18468481231,40,1595158521857
station0,18684812319,18688822219,0,1595158521857
station2,18464812121,18684812319,20,1595158523858
station6,18608881319,18608881319,60,1595158529858
station3,18468481231,18464812121,30,1595158532859
station4,18684812319,18468481231,40,1595158536859
station2,18464812121,18684812319,20,1595158538859
station1,18688822219,18684812319,10,1595158539859
station5,13488822219,13488822219,50,1595158544859
station4,18684812319,18468481231,40,1595158548859
station3,18468481231,18464812121,30,1595158551859
station1,18688822219,18684812319,10,1595158552859
station3,18468481231,18464812121,30,1595158555859
station0,18684812319,18688822219,0,1595158555859
station2,18464812121,18684812319,20,1595158557859
station4,18684812319,18468481231,40,1595158561859
五、数据分析引擎:Flink Table & SQL (不作为重点,举几个例子)
Flink Table & SQL不成熟,还在开发阶段(2020.11.20)
Please note that the Table API and SQL are not yet feature complete and are being actively developed. Not all operations are supported by every combination of [Table API, SQL] and [stream, batch] input.
添加依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge_2.11</artifactId>
<version>1.11.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_2.11</artifactId>
<version>1.11.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-blink_2.11</artifactId>
<version>1.11.0</version>
<scope>provided</scope>
</dependency>
Java demo
(1)批处理
WordCountBatchTableAPI 类
public class WordCountBatchTableAPI {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//创建一个DataSet来代表处理的数据
DataSet<String> source = env.fromElements("I love Beijing","I love China",
"Beijing is the capital of China");
DataSet<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {
@Override
public void flatMap(String s, Collector<WordCount> collector) throws Exception {
String[] words = s.split(" ");
for (String word : words) {
collector.collect(new WordCount(word, 1));
}
}
});
// 创建运行环境
BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
// 将dataset转换成表
Table table = tEnv.fromDataSet(input);
// 处理数据
Table data = table.groupBy("word").select("word,frequency.sum as frequency");
// 再转换为dataset
DataSet<WordCount> result = tEnv.toDataSet(data, WordCount.class);
result.print();
}
}
class WordCount {
public String word;
public long frequency;
public WordCount() {
}
public WordCount(String word, int frequency) {
this.word = word;
this.frequency = frequency;
}
@Override
public String toString() {
return "WordCount [word=" + word + ", frequency=" + frequency + "]";
}
}
WordCountBatchSQL类
public class WordCountBatchSQL {
public static void main(String[] args) throws Exception {
// 创建一个方位接口的对象:DataSet API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//创建一个DataSet来代表处理的数据
DataSet<String> source = env.fromElements("I love Beijing","I love China",
"Beijing is the capital of China");
//生成一个个WordCount
DataSet<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {
@Override
public void flatMap(String value, Collector<WordCount> out) throws Exception {
// I love Beijing
String[] words = value.split(" ");
for(String w:words) {
// k2 v2
out.collect(new WordCount(w,1));
}
}
});
//创建一个Table执行环境
BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
//注册表
tEnv.registerDataSet("WordCount", input,"word,frequency");
//执行SQL
Table table = tEnv.sqlQuery("select word,sum(frequency) as frequency from WordCount group by word");
DataSet<WordCount> result = tEnv.toDataSet(table, WordCount.class);
result.print();
}
//定义class代表表结构
public static class WordCount{
public String word;
public long frequency;
public WordCount() {}
public WordCount(String word,int frequency) {
this.word = word;
this.frequency = frequency;
}
@Override
public String toString() {
return "WordCount [word=" + word + ", frequency=" + frequency + "]";
}
}
}
(2)流式计算
WordCountStreamTableAPI类
public class WordCountStreamTableAPI {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
//接收输入
DataStreamSource<String> source = sEnv.socketTextStream("bigdata111", 1234);
DataStream<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {
@Override
public void flatMap(String value, Collector<WordCount> out) throws Exception {
// 数据:I love Beijing
String[] words = value.split(" ");
for(String word:words) {
out.collect(new WordCount(word,1));
}
}
});
//创建一个表的运行环境
StreamTableEnvironment stEnv = StreamTableEnvironment.create(sEnv);
Table table = stEnv.fromDataStream(input,"word,frequency");
Table result = table.groupBy("word").select("word,frequency.sum").as("word", "frequency");
//输出
stEnv.toRetractStream(result, WordCount.class).print();
sEnv.execute("WordCountStreamTableAPI");
}
//定义class代表表结构
public static class WordCount{
public String word;
public long frequency;
public WordCount() {}
public WordCount(String word,int frequency) {
this.word = word;
this.frequency = frequency;
}
@Override
public String toString() {
return "WordCount [word=" + word + ", frequency=" + frequency + "]";
}
}
}
WordCountStreamSQL类
public class WordCountStreamSQL {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
//接收输入
DataStreamSource<String> source = sEnv.socketTextStream("bigdata111", 1234);
DataStream<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {
@Override
public void flatMap(String value, Collector<WordCount> out) throws Exception {
// 数据:I love Beijing
String[] words = value.split(" ");
for(String word:words) {
out.collect(new WordCount(word,1));
}
}
});
//创建一个表的运行环境
StreamTableEnvironment stEnv = StreamTableEnvironment.create(sEnv);
Table table = stEnv.fromDataStream(input,"word,frequency");
//执行SQL
Table result = stEnv.sqlQuery("select word,sum(frequency) as frequency from " + table + " group by word");
stEnv.toRetractStream(result, WordCount.class).print();
sEnv.execute("WordCountStreamSQL");
}
//定义class代表表结构
public static class WordCount{
public String word;
public long frequency;
public WordCount() {}
public WordCount(String word,int frequency) {
this.word = word;
this.frequency = frequency;
}
@Override
public String toString() {
return "WordCount [word=" + word + ", frequency=" + frequency + "]";
}
}
}