文章目录
- Flink 系列文章
- 一、maven依赖
- 二、Sink示例
- 1、JDBC示例
- 1)、maven依赖
- 2)、实现
- 3)、验证
- 2、kafka示例
- 1)、maven依赖
- 2)、实现
- 3)、验证
- 3、redis示例
- 1)、API介绍
- 2)、maven依赖
- 3)、实现
- 4)、验证
- 4、mysql示例
- 1)、mavne依赖
- 2)、实现
- 3)、验证
- 5、DistributedCache示例
- 1)maven依赖
- 2)、实现
- 3)、验证
- 6、Broadcast变量示例
- 1)、实现
- 2)、验证
本文详细的介绍了Flink的sink6种方式,即JDBC、Mysql、Kafka、redis、分布式缓存和广播变量,以及其实现代码、验证步骤。
本文依赖kafka的环境可用、redis环境可用。
本文分为三个部分,即maven依赖、sink的实现示例。
一、maven依赖
下文中所有示例都是用该maven依赖,除非有特殊说明的情况。
<properties>
<encoding>UTF-8</encoding>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<java.version>1.8</java.version>
<scala.version>2.12</scala.version>
<flink.version>1.12.0</flink.version>
</properties>
<dependencies>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.8</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-scala-bridge_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-blink_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-common</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- 日志 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.7</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.1.4</version>
</dependency>
</dependencies>
二、Sink示例
本部分介绍Flink的sink内容,示例有sink到RMDB、kafka、redis、mysql等。
1、JDBC示例
本示例是将数据插入通过jdbc插入到哦mysql中。
1)、maven依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
<!--<version>8.0.20</version> -->
</dependency>
2)、实现
- user bean
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* @author alanchan
*
*/
@Data
@AllArgsConstructor
@NoArgsConstructor
public class User {
private int id;
private String name;
private String pwd;
private String email;
private int age;
private double balance;
}
- 实现
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.connector.jdbc.JdbcConnectionOptions;
import org.apache.flink.connector.jdbc.JdbcSink;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.source_transformation_sink.bean.User;
/**
* @author alanchan
*
*/
public class JDBCDemo {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
// env
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
// source
DataStream<User> userDS = env.fromElements(new User(1, "alanchanchn", "vx", "alan.chan.chn@163.com", 19, 800));
// transformation
// sink
// public static <T> SinkFunction<T> sink(String sql, JdbcStatementBuilder<T> statementBuilder, JdbcConnectionOptions connectionOptions) {
// return sink(sql, statementBuilder, JdbcExecutionOptions.defaults(), connectionOptions);
// }
String sql = "INSERT INTO `user` (`id`, `name`, `pwd`, `email`, `age`, `balance`) VALUES (null, ?, ?, ?, ?, ?);";
String driverName = "com.mysql.jdbc.Driver";
String url = "jdbc:mysql://192.168.10.44:3306/test?useUnicode=true&characterEncoding=UTF-8&useSSL=false";
String name = "root";
String pw = "root";
// 1、采用匿名类的方式写
// studentDS.addSink(JdbcSink.sink(sql, new JdbcStatementBuilder<User>() {
//
// @Override
// public void accept(PreparedStatement ps, User value) throws SQLException {
// ps.setString(1, value.getName());
// ps.setString(2, value.getPwd());
// ps.setString(3, value.getEmail());
// ps.setInt(4, value.getAge());
// ps.setDouble(5, value.getBalance());
// }
// // (String url, String driverName, String username, String password
// }, new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
// .withDriverName(driverName)
// .withUrl(url)
// .withUsername(name)
// .withPassword(pw)
// .build()));
// 2、采用lambda方式
userDS.addSink(JdbcSink.sink(sql, (ps, value) -> {
ps.setString(1, value.getName());
ps.setString(2, value.getPwd());
ps.setString(3, value.getEmail());
ps.setInt(4, value.getAge());
ps.setDouble(5, value.getBalance());
}, new JdbcConnectionOptions.JdbcConnectionOptionsBuilder().withDriverName(driverName).withUrl(url).withUsername(name).withPassword(pw).build()));
// execute
env.execute();
}
}
3)、验证
数据插入到了mysql中
2、kafka示例
1)、maven依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-sql-connector-kafka_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
2)、实现
import java.util.Properties;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
/**
* @author alanchan
*
*/
public class SinkKafka {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
// env
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
// source
// 准备kafka连接参数
Properties props = new Properties();
// 集群地址
props.setProperty("bootstrap.servers", "server1:9092");
// 消费者组id
props.setProperty("group.id", "flink");
// latest有offset记录从记录位置开始消费,没有记录从最新的/最后的消息开始消费
// earliest有offset记录从记录位置开始消费,没有记录从最早的/最开始的消息开始消费
props.setProperty("auto.offset.reset", "latest");
// 会开启一个后台线程每隔5s检测一下Kafka的分区情况,实现动态分区检测
props.setProperty("flink.partition-discovery.interval-millis", "5000");
// 自动提交
props.setProperty("enable.auto.commit", "true");
// 自动提交的时间间隔
props.setProperty("auto.commit.interval.ms", "2000");
// 使用连接参数创建FlinkKafkaConsumer/kafkaSource
FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<String>("t_kafkasource", new SimpleStringSchema(), props);
// 使用kafkaSource
DataStream<String> kafkaDS = env.addSource(kafkaSource);
// transformation
//以alan作为结尾
SingleOutputStreamOperator<String> etlDS = kafkaDS.filter(new FilterFunction<String>() {
@Override
public boolean filter(String value) throws Exception {
return value.contains("alan");
}
});
// sink
etlDS.print();
Properties props2 = new Properties();
props2.setProperty("bootstrap.servers", "server1:9092");
FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<>("t_kafkasink", new SimpleStringSchema(), props2);
etlDS.addSink(kafkaSink);
// execute
env.execute();
}
}
3)、验证
- 本示例的逻辑如下:
1、使用FlinkKafkaConsumer作为数据源
2、中间经过转换成SingleOutputStreamOperator etlDS
3、SingleOutputStreamOperator etlDS sink到kafka中进行展示 - 验证步骤如下:
1、启动kafka,并启动kafka生产者主题t_kafkasource
kafka-console-producer.sh --broker-list server1:9092 --topic t_kafkasource
[alanchan@server1 onekeystart]$ kafka-console-producer.sh --broker-list server1:9092 --topic t_kafkasource
2、启动应用程序
3、打开kafka消费者主题t_kafkasink观察数据
kafka-console-consumer.sh --bootstrap-server server1:9092 --topic t_kafkasink --from-beginning
[alanchan@server1 testdata]$ kafka-console-consumer.sh --bootstrap-server server1:9092 --topic t_kafkasink --from-beginning
4、在kafka生产者控制台输入
[alanchan@server1 onekeystart]$ kafka-console-producer.sh --broker-list server1:9092 --topic t_kafkasource
>i am alanchanchn alan
>i like flink alan
>and you?
>
5、观察应用程序控制台输出
6、观察kafka消费者控制台输出
[alanchan@server1 testdata]$ kafka-console-consumer.sh --bootstrap-server server1:9092 --topic t_kafkasink --from-beginning
i am alanchan [e][end]
i am alanchanchn alan
i like flink alan
由以上步骤可以发现,kafka生产的数据已经按照期望sink到kafka中了。
3、redis示例
1)、API介绍
flink 提供了专门操作redis 的RedisSink,使用起来更方便,而且不用我们考虑性能的问题,接下来将主要介绍RedisSink 如何使用。
RedisSink 核心类是RedisMapper 是一个接口,使用时要编写自己的redis 操作类实现这个接口中的三个方法,如下所示
- 1.getCommandDescription() ,设置使用的redis 数据结构类型,和key 的名称,通过RedisCommand 设置数据结构类型
- 2.String getKeyFromData(T data),设置value 中的键值对key的值
- 3.String getValueFromData(T data),设置value 中的键值对value的值
19、Join操作map side join 和 reduce side join
这个功能可以被使用来分享外部静态的数据。
1)maven依赖
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.1.4</version>
</dependency>
2)、实现
import java.io.File;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.commons.io.FileUtils;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.MapOperator;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import akka.japi.tuple.Tuple4;
/**
* @author alanchan
*
*/
public class DistributedCacheSink {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
// env
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// Source
// 注册分布式缓存文件
env.registerCachedFile("hdfs://server2:8020//flinktest/words/goodsDistributedCacheFile", "goodsDistributedCacheFile");
// order数据集(id,name,goodsid)
DataSource<Tuple3<Integer, String, Integer>> ordersDS = env
.fromCollection(Arrays.asList(Tuple3.of(1, "alanchanchn", 1), Tuple3.of(2, "alanchan", 4), Tuple3.of(3, "alan", 123)));
// Transformation
// 将ordersDS(id,name,goodsid)中的数据和分布式缓存中goodsDistributedCacheFile的数据(goodsid,goodsname)关联,得到这样格式的数据: (id,name,goodsid,goodsname)
MapOperator<Tuple3<Integer, String, Integer>, Tuple4<Integer, String, Integer, String>> result = ordersDS
// public abstract class RichMapFunction<IN, OUT> extends AbstractRichFunction
// implements MapFunction<IN, OUT> {
// @Override
// public abstract OUT map(IN value) throws Exception;
// }
.map(new RichMapFunction<Tuple3<Integer, String, Integer>, Tuple4<Integer, String, Integer, String>>() {
// 获取缓存数据,并存储,具体以实际应用为准
Map<Integer, String> goodsMap = new HashMap<>();
//读取缓存数据,并放入本地数据结构中
@Override
public void open(Configuration parameters) throws Exception {
// 加载分布式缓存文件
File file = getRuntimeContext().getDistributedCache().getFile("goodsDistributedCacheFile");
List<String> goodsList = FileUtils.readLines(file);
for (String str : goodsList) {
String[] arr = str.split(",");
goodsMap.put(Integer.parseInt(arr[0]), arr[1]);
}
}
//关联数据,并输出需要的数据结构
@Override
public Tuple4<Integer, String, Integer, String> map(Tuple3<Integer, String, Integer> value) throws Exception {
// 使用分布式缓存文件中的数据
// 返回(id,name,goodsid,goodsname)
return new Tuple4(value.f0, value.f1, value.f2, goodsMap.get(value.f2));
}
});
// Sink
result.print();
}
}
3)、验证
- 验证步骤
1、准备分布式文件及其内容,并上传至hdfs中
2、运行程序,查看输出 - 验证
1、缓存文件内容
2、上传至hdfs
3、运行程序,查看结果
6、Broadcast变量示例
Flink支持广播。可以将数据广播到TaskManager上就可以供TaskManager中的SubTask/task去使用,数据存储到内存中。这样可以减少大量的shuffle操作,而不需要多次传递给集群节点。比如在数据join阶段,可以把其中一个dataSet广播出去,一直加载到taskManager的内存中,可以直接在内存中拿数据,避免了大量的shuffle,导致集群性能下降;
本示例实现上一个缓存示例一样的内容,不过是使用广播实现的。该示例比较简单,实现逻辑与分布式缓存基本上一样。
1)、实现
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.MapOperator;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import akka.japi.tuple.Tuple4;
/**
* @author alanchan
*
*/
public class BroadcastSink {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
// env
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// Source
// student数据集(学号,姓名)
DataSource<Tuple2<Integer, String>> studentDS = env.fromCollection(Arrays.asList(Tuple2.of(1, "alan"), Tuple2.of(2, "alanchan"), Tuple2.of(3, "alanchanchn")));
// score数据集(学号,学科,成绩)
DataSource<Tuple3<Integer, String, Integer>> scoreDS = env
.fromCollection(Arrays.asList(Tuple3.of(1, "chinese", 50), Tuple3.of(1, "math", 90), Tuple3.of(1, "english", 90), Tuple3.of(2, "math", 70), Tuple3.of(3, "art", 86)));
// Transformation
// 将studentDS(学号,姓名)集合广播出去(广播到各个TaskManager内存中)
// 然后使用scoreDS(学号,学科,成绩)和广播数据studentDS(学号,姓名)进行关联,得到这样格式的数据:(学号,姓名,学科,成绩)
MapOperator<Tuple3<Integer, String, Integer>, Tuple4<Integer, String, String, Integer>> result = scoreDS
.map(new RichMapFunction<Tuple3<Integer, String, Integer>, Tuple4<Integer, String, String, Integer>>() {
Map<Integer, String> studentsMap = new HashMap<>();
@Override
public void open(Configuration parameters) throws Exception {
// 获取广播数据
List<Tuple2<Integer, String>> studentList = getRuntimeContext().getBroadcastVariable("studentsInfo");
for (Tuple2<Integer, String> tuple : studentList) {
studentsMap.put(tuple.f0, tuple.f1);
}
}
@Override
public Tuple4<Integer, String, String, Integer> map(Tuple3<Integer, String, Integer> value) throws Exception {
// 使用广播数据
Integer stuID = value.f0;
String stuName = studentsMap.getOrDefault(stuID, "");
return new Tuple4(stuID, stuName, value.f1, value.f2);
}
}).withBroadcastSet(studentDS, "studentsInfo");
// 4.Sink
result.print();
}
}
2)、验证
运行应用程序,结果如下
以上,详细的介绍了Flink的sink6种方式,即JDBC、Mysql、Kafka、redis、分布式缓存和广播变量。