TopN语句用于获取流表中的前N条记录。
本文总结Flink Table & SQL 中实时TopN的用法。
语法
SELECT [column_list]
FROM (
SELECT [column_list],
ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
ORDER BY col1 [asc|desc][, col2 [asc|desc]...]) AS rownum
FROM table_name)
WHERE rownum <= N [AND conditions]
说明
ROW_NUMBER()
: 根据分区内各行的顺序,为每一行分配一个唯一的序号(从1开始)。PARTITION BY col1[, col2...]
: 指定分区列,也可以不指定。每个分区都会产生一个TopN结果。ORDER BY col1 [asc|desc][, col2 [asc|desc]...]
: 指定排序列,排序列可以有多个。
注意:
- TopN需要两层Query:
- 内层查询使用
ROW_NUMBER() OVER
窗口函数对分区内(通过PARTITION BY
指定)的数据根据排序列(通过ORDER BY
指定)标上排名(rownum
)。 - 外层查询对排名进行过滤,取TopN。
- 外层查询
WHERE
条件中,必须通过如rownum <= N
指定,Flink才能将其识别为TopN查询。 - 外层查询
WHERE
条件中,还可以指定其他条件([AND conditions]
),但是必须使用AND来连接多个条件。 - 在计算过程中,如果某个分区的TopN发生了变化,则会将变化的记录以撤消/添加记录的形式发送到下游。建议使用支持更新的存储作为TopN查询结果的存储器。此外,如果需要将TopN记录存储在外部存储中,则外部存储应有相对应的主键或唯一键。
- TopN查询的唯一键是分区列和rownum列的组合。
- 无排名输出优化
- 使用无排名输出优化,可以解决数据膨胀问题。
- 数据膨胀问题
根据TopN的语法,rownum字段会作为结果表的唯一键字段之一写入结果表。但是这可能导致数据膨胀的问题。例如,收到一条原排名9的更新数据,更新后排名上升到1,则从1到9的数据排名都发生变化了,需要将这些数据作为更新都写入结果表。这样就产生了数据膨胀,导致结果表因为收到了太多的数据而降低了更新速度。
- 无排名输出优化方法
结果表中不保存 rownum,最终的 rownum 由使用方取数时决定。因为TopN的数据量通常不会很大,使用方取数时排序也会很快。当收到一条原排名9,更新后排名上升到1的数据,此时,只需要发送这一条数据,不用把排名1到9的数据全发送下去进行更新。这种优化能明显提升结果表的更新速度。
- 无排名输出优化语法
SELECT col1, col2, col3
FROM (
SELECT col1, col2, col3
ROW_NUMBER() OVER ([PARTITION BY col1[, col2..]]
ORDER BY col1 [asc|desc][, col2 [asc|desc]...]) AS rownum
FROM table_name)
WHERE rownum <= N [AND conditions]
语法与上述类似,只是在外层查询输出时将rownum字段去掉了。此时,需要注意,在无排名输出优化中,唯一键应为TopN上游GROUP BY的KEY列表。另外,数据使用方在使用时需要自己对记录进行快速排序,获取真正的TopN结果。
- 仅支持Blink Planner。
实时获取每小时每个商品下点击量最高的Top3
测试数据
//自己造的测试数据,如下:
//某个用户在某个时刻点击了某个商品
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:00:00","productID":"product_1"}
结果表表结构
CREATE TABLE `t_product_click_topn` (
`datetime` varchar(15) NOT NULL,
`productID` varchar(15) NOT NULL,
`userID` varchar(15) NOT NULL,
`clickPV` bigint(20) NOT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',
PRIMARY KEY (`datetime`,`productID`,`userID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
代码示例
package com.bigdata.flink.topN;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.java.StreamTableEnvironment;
/**
* Author: Wang Pei
* Summary:
* Streaming TopN
*/
public class StreamingTopN {
public static void main(String[] args) throws Exception {
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
StreamExecutionEnvironment streamEnv = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(streamEnv, settings);
// Source DDL
String sourceDDL = ""
+ "create table source_kafka "
+ "( "
+ " userID String, "
+ " eventType String, "
+ " eventTime String, "
+ " productID String "
+ ") with ( "
+ " 'connector.type' = 'kafka', "
+ " 'connector.version' = '0.10', "
+ " 'connector.properties.bootstrap.servers' = 'kafka01:9092', "
+ " 'connector.properties.zookeeper.connect' = 'kafka01:2181', "
+ " 'connector.topic' = 'test_1', "
+ " 'connector.properties.group.id' = 'c1_test_1', "
+ " 'connector.startup-mode' = 'latest-offset', "
+ " 'format.type' = 'json' "
+ ")";
tableEnv.sqlUpdate(sourceDDL);
//tableEnv.toAppendStream(tableEnv.from("source_kafka"), Row.class).print();
// Sink DDL
String sinkDDL = ""
+ "create table sink_mysql "
+ "( "
+ " datetime STRING, "
+ " productID STRING, "
+ " userID STRING, "
+ " clickPV BIGINT "
+ ") with ( "
+ " 'connector.type' = 'jdbc', "
+ " 'connector.url' = 'jdbc:mysql://localhost:3306/bigdata', "
+ " 'connector.table' = 't_product_click_topn', "
+ " 'connector.username' = 'root', "
+ " 'connector.password' = 'bigdata', "
+ " 'connector.write.flush.max-rows' = '50', "
+ " 'connector.write.flush.interval' = '2s', "
+ " 'connector.write.max-retries' = '3' "
+ ")";
tableEnv.sqlUpdate(sinkDDL);
// Query
String execSQL = ""
+ "INSERT INTO sink_mysql "
+ "SELECT datetime, productID, userID, clickPV "
+ "FROM ( "
+ " SELECT *, "
+ " ROW_NUMBER() OVER (PARTITION BY datetime, productID ORDER BY clickPV desc) AS rownum "
+ " FROM ( "
+ " SELECT SUBSTRING(eventTime,1,13) AS datetime, "
+ " productID, "
+ " userID, "
+ " count(1) AS clickPV "
+ " FROM source_kafka "
+ " GROUP BY SUBSTRING(eventTime,1,13), productID, userID "
+ " ) a "
+ ") t "
+ "WHERE rownum <= 3";
tableEnv.sqlUpdate(execSQL);
tableEnv.execute(StreamingTopN.class.getSimpleName());
}
}
结果查看
Kafka中输入测试数据,如下:
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:01:00","productID":"product_1"}
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:02:00","productID":"product_1"}
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:03:00","productID":"product_1"}
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:04:00","productID":"product_1"}
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:05:00","productID":"product_1"}
{"userID":"user_2","eventType":"click","eventTime":"2015-01-01 01:01:00","productID":"product_1"}
{"userID":"user_2","eventType":"click","eventTime":"2015-01-01 01:02:00","productID":"product_1"}
{"userID":"user_2","eventType":"click","eventTime":"2015-01-01 01:03:00","productID":"product_1"}
{"userID":"user_3","eventType":"click","eventTime":"2015-01-01 01:01:00","productID":"product_1"}
{"userID":"user_3","eventType":"click","eventTime":"2015-01-01 01:04:00","productID":"product_1"}
{"userID":"user_4","eventType":"click","eventTime":"2015-01-01 01:05:00","productID":"product_1"}
查看Mysql中的结果
mysql> select * from t_product_click_topn;
+---------------+-----------+--------+---------+---------------------+---------------------+
| datetime | productID | userID | clickPV | created_at | updated_at |
+---------------+-----------+--------+---------+---------------------+---------------------+
| 2015-01-01 01 | product_1 | user_1 | 5 | 2020-04-05 22:50:55 | 2020-04-05 22:50:55 |
| 2015-01-01 01 | product_1 | user_2 | 3 | 2020-04-05 22:50:55 | 2020-04-05 22:50:55 |
| 2015-01-01 01 | product_1 | user_3 | 2 | 2020-04-05 22:50:55 | 2020-04-05 22:50:55 |
+---------------+-----------+--------+---------+---------------------+---------------------+