flink Table json flink table jsonpath

转载

mob64ca141834d3 2024-08-17 13:25:17

文章标签 flink Table json flink API apache 文章分类 架构后端开发

文章目录

一、什么是Tabel API

1.1 Flink API 总览
1.2 Table API 的特性

二、Table API 编程

2.1 WordCount 示例
2.2 Table API 操作

How to get a table?
How to emit a table?
How to query a table?
Table API 的分类
Columns Operarion & Function
Row-based Operation

三、Table API 动态

一、什么是Tabel API

1.1 Flink API 总览

flink Table json flink table jsonpath_apache

1.2 Table API 的特性

flink Table json flink table jsonpath_flink Table json_02

以 wordcount 为例，Table API 与 SQL 的对比：

高性能：groupby 的聚合只计算一次，后面如果多次select恢复用前面聚合的结果的。

流批统一：Table API 的对于流计算和批计算的API只有统一的一套，方便开发。

flink Table json flink table jsonpath_flink Table json_03

如何理解，Tabel API 使得多声明的数据处理写写来比较容易

// 一个过滤操作，将不同的结果插入到不同的表中
Table.filter(a < 10).insertInto("table1")
Talbe.filter(a > 10).insertInto("table2")

以上情况使用 Table API 会比 SQL 简单的多。

总的来说，Table API 可以看做是 SQL 的一个超集，因为 Table API 是 Flink 自身的API，其易用性、功能性和扩展性都有一定的提升。

二、Table API 编程

2.1 WordCount 示例

https://github.com/hequn8128/TableApiDemo

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.BatchTableEnvironment;
import org.apache.flink.table.descriptors.FileSystem;
import org.apache.flink.table.descriptors.OldCsv;
import org.apache.flink.table.descriptors.Schema;
import org.apache.flink.types.Row;

public class JavaBatchWordCount {

	public static void main(String[] args) throws Exception {
		ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
		BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);

		String path = JavaBatchWordCount.class.getClassLoader().getResource("words.txt").getPath();
		// 读取文件
		tEnv.connect(new FileSystem().path(path))
			// 指定格式（csv/行分隔符）
			.withFormat(new OldCsv().field("word", Types.STRING).lineDelimiter("\n"))
			// 指定Schema
			.withSchema(new Schema().field("word", Types.STRING))
			// 将这个source文件注册到Env中
			.registerTableSource("fileSource");
		// 扫描source，拿到table，进行TableAPI编程
		Table result = tEnv.scan("fileSource")
			.groupBy("word")
			.select("word, count(1) as count");

		tEnv.toDataSet(result, Row.class).print();
	}
}

注意：在使用 Table Environment 的时候要引入所需要的包下的 Environment。

当前的 Table Environment 有已下8种：

flink Table json flink table jsonpath_flink_04

2.2 Table API 操作

How to get a table?

可以理解为 table 是注册到 env 中，再从 env 中 scan 出来的。

tEnv.
	...
	...
	.registerTableSource("Mytable");
Table myTable = tablEnvironment.scan("Mytable");

已下是3中注册table的方法：

Table descriptor
指定某个文件系统，指定格式，schema

tEnv
.connect(
new FileSystem()
.path(path))
.withFormat(
new OldCsv()
.field("word", Types.STRING)
.lineDelimiter("\n"))
.withSchema(
new Schema()
.field("word", Types.STRING))
.registerTableSource("sourceTable");

User defined table source
根据 table source 的接口，写一个自定义的 table source，然后向 env 中注册。

TableSource csvSource = new CsvTableSource(
path,
new String[]{"word"},
new TypeInformation[]{Types.STRING});
tEnv.registerTableSource("sourceTable2", csvSource);

DataStream<String> stream = ...
// register the DataStream as table " myTable3" with
// fields "word"
tableEnv
.registerDataStream("myTable3", stream, "word");

有了以上3种注册table的方式，就可以将 table 注册到 env 中，在 scan 出来，进行 Table API 编程。

How to emit a table?

resultTable 是一个table类型的结果表，使用insertInto可以将其输出到一个目标表中。

resultTable.insertInto("TargetTable");

同样有3中输出table的方式：

Table descriptor

tEnv
.connect(
new FileSystem()
.path(path))
.withFormat(
new OldCsv()
.field("word", Types.STRING)
.lineDelimiter("\n"))
.withSchema(
new Schema()
.field("word", Types.STRING))
.registerTableSink("targetTable");

User defined table sink

TableSink csvSink = new CsvTableSink(
path,
new String[]{"word"},
new TypeInformation[]{Types.STRING});
tEnv.registerTableSink("sinkTable2", csvSink);

emit to a DataStream

// emit the result table to a DataStream
DataStream<Tuple2<Boolean, Row>> stream =
tableEnv
.toRetractStream(resultTable, Row.class);

How to query a table?

flink Table json flink table jsonpath_apache_05

Table API 的分类

flink Table json flink table jsonpath_apache_06

Columns Operarion & Function

Columns Operarion（易用性）

// 新增一列
AddColumns Table orders = tableEnv.scan("Orders");
Table result = orders.addColumns(“concat(c, ‘sunny‘) as desc");
// 新增一列且覆盖原有列
AddOrReplaceColumns Table orders = tableEnv.scan("Orders");
Table result = orders.addOrReplaceColumns("concat(c, 'sunny') as desc");
// 删除一列
DropColumns Table orders = tableEnv.scan("Orders");
Table result = orders.dropColumns("b, c");
// 重命名一列
RenameColumns Table orders = tableEnv.scan("Orders");
Table result = orders.renameColumns("b as b2, c as c2");

Columns Function（易用性）

flink Table json flink table jsonpath_apache_07

// 选择指定列：2到4列
select("withColumns(2 to 4)"）
// 反选指定的列：除2到4列以外的列
select("withoutColumns 2 to 4")

关于 Colums Function 的参数

flink Table json flink table jsonpath_flink Table json_08

可以传入引用、下标、列名等。

Columns Operation & Function 总结

flink Table json flink table jsonpath_apache_09

Row-based Operation

map Operation（易用性）

map 中需要定义一个 scalarFunction，来对每一列进行独立的map操作。

当一个table的列很多，且一次select要对每一个列进行udf操作，那么可以使用map统一进行操作。如下：

flink Table json flink table jsonpath_apache_10

flatmap Operation（易用性）

输入一行输出多行，flatmap 操作，其中要定义一个 TableFunction

flink Table json flink table jsonpath_API_11

aggregate Operation（易用性）

输入多行输出一行，接收一个 aggergateFunction，以 Count 为例，先定义一个 CountAccumulater L累加器，然后写聚合逻辑，最终 getValue 将结果返回。

flink Table json flink table jsonpath_apache_12

FlatAggregate Operarion（功能性，新功能扩展）

输入多行输出多行，例如 topN操作，其中要传入一个TableAggreateFunction，先定义一个TopNAcc累加器，然后进行accumulate操作，emitValue可以拿到Colletor，就可以多次输出结果，如下

flink Table json flink table jsonpath_flink Table json_13

Aggregate 与 TableAggregate 比较

flink Table json flink table jsonpath_apache_14

可以看到，在步骤2的累计中间结果的部分，max会记录一个最大值，top2的逻辑是记录两个值，但是最后getValue只输出一次，而emitValue可以输出两次，完成top2的逻辑。

Row-based Operation 总结

flink Table json flink table jsonpath_flink Table json_15

三、Table API 动态

3.1 Flip29
https://issues.apache.org/jira/browse/FLINK-11199 3.2 Python Table API
https://issues.apache.org/jira/browse/FLINK-10972 3.3 Interactive Programming(交互式编程)
https://issues.apache.org/jira/browse/FLINK-12308 3.4 Iterative Processing(迭代计算)
https://issues.apache.org/jira/browse/FLINK-11199

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：环路用wireshark 环路组网

下一篇：SQL Server是否具有行级列级审计 sqlserver行级锁

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯