TopN语句用于获取流表中的前N条记录。

本文总结Flink Table & SQL 中实时TopN的用法。

语法

SELECT [column_list]
FROM (
   SELECT [column_list],
     ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
       ORDER BY col1 [asc|desc][, col2 [asc|desc]...]) AS rownum
   FROM table_name)
WHERE rownum <= N [AND conditions]

说明

  • ROW_NUMBER(): 根据分区内各行的顺序,为每一行分配一个唯一的序号(从1开始)。
  • PARTITION BY col1[, col2...]: 指定分区列,也可以不指定。每个分区都会产生一个TopN结果。
  • ORDER BY col1 [asc|desc][, col2 [asc|desc]...]: 指定排序列,排序列可以有多个。

注意:

  1. TopN需要两层Query:
  • 内层查询使用ROW_NUMBER() OVER窗口函数对分区内(通过PARTITION BY指定)的数据根据排序列(通过ORDER BY指定)标上排名(rownum)。
  • 外层查询对排名进行过滤,取TopN。
  1. 外层查询WHERE条件中,必须通过如rownum <= N指定,Flink才能将其识别为TopN查询。
  2. 外层查询WHERE条件中,还可以指定其他条件([AND conditions]),但是必须使用AND来连接多个条件。
  3. 在计算过程中,如果某个分区的TopN发生了变化,则会将变化的记录以撤消/添加记录的形式发送到下游。建议使用支持更新的存储作为TopN查询结果的存储器。此外,如果需要将TopN记录存储在外部存储中,则外部存储应有相对应的主键或唯一键。
  4. TopN查询的唯一键是分区列和rownum列的组合。
  5. 无排名输出优化
  • 使用无排名输出优化,可以解决数据膨胀问题。
  • 数据膨胀问题

根据TopN的语法,rownum字段会作为结果表的唯一键字段之一写入结果表。但是这可能导致数据膨胀的问题。例如,收到一条原排名9的更新数据,更新后排名上升到1,则从1到9的数据排名都发生变化了,需要将这些数据作为更新都写入结果表。这样就产生了数据膨胀,导致结果表因为收到了太多的数据而降低了更新速度。

  • 无排名输出优化方法

结果表中不保存 rownum,最终的 rownum 由使用方取数时决定。因为TopN的数据量通常不会很大,使用方取数时排序也会很快。当收到一条原排名9,更新后排名上升到1的数据,此时,只需要发送这一条数据,不用把排名1到9的数据全发送下去进行更新。这种优化能明显提升结果表的更新速度。

  • 无排名输出优化语法
SELECT col1, col2, col3
FROM (
 SELECT col1, col2, col3
   ROW_NUMBER() OVER ([PARTITION BY col1[, col2..]]
   ORDER BY col1 [asc|desc][, col2 [asc|desc]...]) AS rownum
 FROM table_name)
WHERE rownum <= N [AND conditions]

语法与上述类似,只是在外层查询输出时将rownum字段去掉了。此时,需要注意,在无排名输出优化中,唯一键应为TopN上游GROUP BY的KEY列表。另外,数据使用方在使用时需要自己对记录进行快速排序,获取真正的TopN结果。

  1. 仅支持Blink Planner。

实时获取每小时每个商品下点击量最高的Top3

测试数据

//自己造的测试数据,如下:
//某个用户在某个时刻点击了某个商品
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:00:00","productID":"product_1"}

结果表表结构

CREATE TABLE `t_product_click_topn` (
  `datetime` varchar(15) NOT NULL,
  `productID` varchar(15) NOT NULL,
  `userID` varchar(15) NOT NULL,
  `clickPV` bigint(20) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',
  PRIMARY KEY (`datetime`,`productID`,`userID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

代码示例

package com.bigdata.flink.topN;

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.java.StreamTableEnvironment;

/**
 * Author: Wang Pei
 * Summary:
 * Streaming TopN
 */
public class StreamingTopN {
    public static void main(String[] args) throws Exception {

        EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
        StreamExecutionEnvironment streamEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(streamEnv, settings);

        // Source DDL
        String sourceDDL = ""
                + "create table source_kafka "
                + "( "
                + "    userID String, "
                + "    eventType String, "
                + "    eventTime String, "
                + "    productID String "
                + ") with ( "
                + "    'connector.type' = 'kafka', "
                + "    'connector.version' = '0.10', "
                + "    'connector.properties.bootstrap.servers' = 'kafka01:9092', "
                + "    'connector.properties.zookeeper.connect' = 'kafka01:2181', "
                + "    'connector.topic' = 'test_1', "
                + "    'connector.properties.group.id' = 'c1_test_1', "
                + "    'connector.startup-mode' = 'latest-offset', "
                + "    'format.type' = 'json' "
                + ")";
        tableEnv.sqlUpdate(sourceDDL);
        //tableEnv.toAppendStream(tableEnv.from("source_kafka"), Row.class).print();

        // Sink DDL
        String sinkDDL = ""
                + "create table sink_mysql "
                + "( "
                + "     datetime STRING, "
                + "     productID STRING, "
                + "     userID STRING, "
                + "     clickPV BIGINT "
                + ") with ( "
                + "    'connector.type' = 'jdbc', "
                + "    'connector.url' = 'jdbc:mysql://localhost:3306/bigdata', "
                + "    'connector.table' = 't_product_click_topn', "
                + "    'connector.username' = 'root', "
                + "    'connector.password' = 'bigdata', "
                + "    'connector.write.flush.max-rows' = '50', "
                + "    'connector.write.flush.interval' = '2s', "
                + "    'connector.write.max-retries' = '3' "
                + ")";
        tableEnv.sqlUpdate(sinkDDL);

        // Query
        String execSQL = ""
                + "INSERT INTO sink_mysql "
                + "SELECT datetime, productID, userID, clickPV "
                + "FROM ( "
                + "  SELECT *, "
                + "     ROW_NUMBER() OVER (PARTITION BY datetime, productID ORDER BY clickPV desc) AS rownum "
                + "  FROM ( "
                + "        SELECT SUBSTRING(eventTime,1,13) AS datetime, "
                + "            productID, "
                + "            userID, "
                + "            count(1) AS clickPV "
                + "        FROM source_kafka "
                + "        GROUP BY SUBSTRING(eventTime,1,13), productID, userID "
                + "    ) a "
                + ") t "
                + "WHERE rownum <= 3";
        tableEnv.sqlUpdate(execSQL);

        tableEnv.execute(StreamingTopN.class.getSimpleName());
    }
}

结果查看

Kafka中输入测试数据,如下:

{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:01:00","productID":"product_1"}
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:02:00","productID":"product_1"}
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:03:00","productID":"product_1"}
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:04:00","productID":"product_1"}
{"userID":"user_1","eventType":"click","eventTime":"2015-01-01 01:05:00","productID":"product_1"}
{"userID":"user_2","eventType":"click","eventTime":"2015-01-01 01:01:00","productID":"product_1"}
{"userID":"user_2","eventType":"click","eventTime":"2015-01-01 01:02:00","productID":"product_1"}
{"userID":"user_2","eventType":"click","eventTime":"2015-01-01 01:03:00","productID":"product_1"}
{"userID":"user_3","eventType":"click","eventTime":"2015-01-01 01:01:00","productID":"product_1"}
{"userID":"user_3","eventType":"click","eventTime":"2015-01-01 01:04:00","productID":"product_1"}
{"userID":"user_4","eventType":"click","eventTime":"2015-01-01 01:05:00","productID":"product_1"}

查看Mysql中的结果

mysql> select * from t_product_click_topn;
+---------------+-----------+--------+---------+---------------------+---------------------+
| datetime      | productID | userID | clickPV | created_at          | updated_at          |
+---------------+-----------+--------+---------+---------------------+---------------------+
| 2015-01-01 01 | product_1 | user_1 |       5 | 2020-04-05 22:50:55 | 2020-04-05 22:50:55 |
| 2015-01-01 01 | product_1 | user_2 |       3 | 2020-04-05 22:50:55 | 2020-04-05 22:50:55 |
| 2015-01-01 01 | product_1 | user_3 |       2 | 2020-04-05 22:50:55 | 2020-04-05 22:50:55 |
+---------------+-----------+--------+---------+---------------------+---------------------+