一、概述
多流转换:在实际应用中,可能需要将不同来源的数据连接合并在一起处理,也有可能需要将一条数据流拆分开,所以经常会对多条流进行处理的场景,具体可以分为 “分流” 和 “合流” 两大类。
“分流”:一般是通过侧输出流(side output)来实现。
“合流”:根据不同的需求,可以使用 union、connect、join 以及 coGroup 等方式进行连接合并操作。
一条流可以分开成多条流,多条流也可以合并成一条流,本篇幅主要描述“基本合流操作-联合(Union)流”。最简单的合流操作,就是将多条流合在一起,叫做流的 “联合(union)”,如下图所示:
使用方式:
基于 DataStream 直接调用 union() 方法,传入其他 DataStream 作为参数,就可以实现流的联合了,得到的依然是一个 DataStream;
如:stream1.union(stream2,stream3,…)
注意:
- 联合操作中的所有数据流的数据类型必须相同,合并流中包含了所有流中的元素。
- union() 方法的参数可以是多个 DataStream。
二、案例
2.1 依赖相关
父工程(flink-demo)的 pom.xml 文件:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.antiy</groupId>
<artifactId>flink-demo</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>pom</packaging>
<name>flink-demo</name>
<url>http://maven.apache.org</url>
<modules>
<module>muti-stream-union</module>
</modules>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<flink.version>1.14.4</flink.version>
<scala.binary.version>2.12</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>2.0.10</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.24</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
子工程(muti-stream-union)的 pom.xml 文件:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>flink-demo</artifactId>
<groupId>cn.antiy</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<groupId>cn.antiy</groupId>
<artifactId>muti-stream-union</artifactId>
<packaging>jar</packaging>
<name>muti-stream-union</name>
<url>http://maven.apache.org</url>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
2.2 案例1-奇数流联合偶数流
2.2.1 简单联合
流程图:
代码如下:
package cn.antiy.union.base;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* "奇数流" Union(联合) “偶数流” 的测试
*/
public class OddUnionEvenDataStreamTest {
public static void main(String[] args) throws Exception {
// 创建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 奇数流
DataStreamSource<Integer> oddNumStream = env.fromElements(1, 3, 5, 7);
// 偶数流
DataStreamSource<Integer> evenStream = env.fromElements(2, 4, 6, 8);
// 双流联合(union) 且 控制台打印
oddNumStream.union(evenStream).print();
// 启动任务
env.execute("OddUnionEvenDataStreamTest");
}
}
运行截图:
2.3 案例2-登录事件流联合下载事件流
2.3.1 简单联合
流程图:
Event 事件实体类代码:
package cn.antiy.union.event.v1.entity;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* 事件 实体类
* 包含了“登录日志” 和 “下载日志” 两者的全部属性信息
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class Event {
/**
* 时间戳
*/
private Long timestamp;
/**
* 用户 id
*/
private String userId;
/**
* ip 地址
*/
private String ipAddress;
/**
* 事件类型
*/
private String eventType;
/**
* 备用字段,用于扩充单条数据的数据大小
*/
private String remark;
/**
* 下载状态:0-下载失败;1-下载成功
*/
private String downloadStatus;
/**
* 文件名
*/
private String filename;
}
联合测试类代码:
package cn.antiy.union.event.v1;
import cn.antiy.union.event.v1.entity.Event;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.time.Duration;
/**
* "登录数据流" 联合(Union) "下载数据流" 的测试
*/
public class LoginUnionDownloadDataStreamTest {
public static void main(String[] args) throws Exception {
// 创建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 登录日志流
DataStream<Event> loginEventStream = env.fromElements(
new Event(1659657901L, "A", "192.168.1.1", "0", "2022-08-05 08:05:01", null, null),
new Event(1659657903L, "A", "192.168.1.1", "0", "2022-08-05 08:05:03", null, null),
new Event(1659657905L, "A", "192.168.1.1", "0", "2022-08-05 08:05:05", null, null),
new Event(1659657907L, "A", "192.168.1.1", "0", "2022-08-05 08:05:07", null, null)
);
// 下载日志流
DataStream<Event> downloadEventStream = env.fromElements(
new Event(1659657902L, "A", null, null, "2022-08-05 08:05:02", "O", "西游记"),
new Event(1659657904L, "A", null, null, "2022-08-05 08:05:04", "1", "三国志"),
new Event(1659657906L, "A", null, null, "2022-08-05 08:05:06", "1", "红楼梦"),
new Event(1659657908L, "A", null, null, "2022-08-05 08:05:08", "1", "水浒传"),
new Event(1659657966L, "A", null, null, "2022-08-05 08:06:06", "1", "山海经")
);
// 双流联合(union) 且 控制台打印
loginEventStream.union(downloadEventStream).print();
// 启动任务
env.execute("LoginUnionDownloadDataStreamTest");
}
}
控制台输出:
2.3.2 联合且事件时间排序
根据上述 2.3.1 中案例可知,简单的 union 只是把两个数据流进行联合,联合后的数据是无序的;如果想要联合后的数据变成有序性,应该自定义排序规则,相关案例如下:
流程图:
Event 事件实体类代码:
package cn.antiy.union.event.v2.entity;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* 事件实体类
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class Event implements Comparable<Event> {
/**
* 时间戳
*/
private Long timestamp;
/**
* 用户 id
*/
private String userId;
/**
* ip 地址
*/
private String ipAddress;
/**
* 事件类型
*/
private String eventType;
/**
* 备用字段,用于扩充单条数据的数据大小
*/
private String remark;
/**
* 下载状态:0-下载失败;1-下载成功
*/
private String downloadStatus;
/**
* 文件名
*/
private String filename;
/**
* DataStream 的 keyby 方法分组排序使用字段
*/
public String getKey() {
return "1";
}
@Override
public int compareTo(Event o) {
return Long.compare(this.timestamp, o.timestamp);
}
}
自定义排序处理函数代码:
package cn.antiy.union.event.v2;
import cn.antiy.union.event.v2.entity.Event;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimerService;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
import java.util.PriorityQueue;
/**
* 时间戳排序函数
*/
public class SortByTimestampFunction extends KeyedProcessFunction<String, Event, Event> {
private ValueState<PriorityQueue<Event>> queueState = null;
@Override
public void open(Configuration config) {
ValueStateDescriptor<PriorityQueue<Event>> descriptor = new ValueStateDescriptor<>(
// state name
"sorted-events",
// type information of state
TypeInformation.of(new TypeHint<PriorityQueue<Event>>() {
}));
queueState = getRuntimeContext().getState(descriptor);
}
@Override
public void processElement(Event event, Context context, Collector<Event> out) throws Exception {
TimerService timerService = context.timerService();
long currentWatermark = timerService.currentWatermark();
// System.out.format("processElement called with watermark %d\n", currentWatermark);
if (context.timestamp() > currentWatermark) {
PriorityQueue<Event> queue = queueState.value();
if (queue == null) {
queue = new PriorityQueue<>(10);
}
queue.add(event);
queueState.update(queue);
timerService.registerEventTimeTimer(event.getTimestamp());
}
}
@Override
public void onTimer(long timestamp, OnTimerContext context, Collector<Event> out) throws Exception {
PriorityQueue<Event> queue = queueState.value();
long watermark = context.timerService().currentWatermark();
// System.out.format("onTimer called with watermark %d\n", watermark);
Event head = queue.peek();
while (head != null && head.getTimestamp() <= watermark) {
out.collect(head);
queue.remove(head);
head = queue.peek();
}
}
}
联合且排序的测试类代码:
package cn.antiy.union.event.v2;
import cn.antiy.union.event.v2.entity.Event;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.time.Duration;
/**
* “登录与下载数据流的联合排序”测试案例二:未使用过时方法(推荐此种方式)
*/
public class UnionAndSortDataStreamTest {
public static void main(String[] args) throws Exception {
// 创建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 登录日志流
DataStream<Event> loginEventStream = env.fromElements(
new Event(1659657901L,"2022-08-05 08:05:01", "192.168.1.1", "0", null, null, null),
new Event(1659657903L,"2022-08-05 08:05:03", "192.168.1.1", "0", null, null, null),
new Event(1659657905L, "2022-08-05 08:05:05", "192.168.1.1", "0", null, null, null),
new Event(1659657907L, "2022-08-05 08:05:07", "192.168.1.1", "0", null, null, null)
).assignTimestampsAndWatermarks(
WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(1L))
.withTimestampAssigner(
(Event event, long l) -> {
return event.getTimestamp();
}
)
);
// 下载日志流
DataStream<Event> downloadEventStream = env.fromElements(
new Event(1659657902L,"2022-08-05 08:05:02", null, null, null, "O", "西游记"),
new Event(1659657904L,"2022-08-05 08:05:04", null, null, null, "1", "三国志"),
new Event(1659657906L,"2022-08-05 08:05:06", null, null, null, "1", "红楼梦"),
new Event(1659657908L, "2022-08-05 08:05:08", null, null, null, "1", "水浒传"),
new Event(1659657966L, "2022-08-05 08:06:06", null, null, null, "1", "山海经")
).assignTimestampsAndWatermarks(
WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(1L))
.withTimestampAssigner(
(Event event, long l) -> {
return event.getTimestamp();
}
)
);
// 联合(union) 数据流,并按照时间戳排序且打印输出
loginEventStream.union(downloadEventStream)
.keyBy(r -> r.getKey())
.process(new SortByTimestampFunction())
.print();
// 启动任务
env.execute("UnionAndSortDataStreamTest");
}
}
控制台输出:
由上图可知:
- 乱序的数据流 Union 联合、划定水位线、指定排序规则后,合并的数据流输出后,按照时间戳大小排序输出,达到预期效果。
小结
关于 Flink 多流转换的-Union 联合操作小试牛刀,并且根据水位线进行联合且排序功能。