1- 简介

在开发过程中,如果遇到需要下发/广播配置、规则等低吞吐事件流到下游所有 task 时,就可以使用 Broadcast State。Broadcast State 是 Flink 1.5 引入的特性。
下游的 task 接收这些配置、规则并保存为 BroadcastState, 将这些配置应用到另一个数据流的计算中 。

场景例子:

  • 1)动态更新计算规则: 如事件流需要根据最新的规则进行计算,则可将**规则(数据量较少的)**作为广播状态广播到下游Task中。
  • 2)实时增加额外字段: 如事件流需要实时增加用户的基础信息,则可将用户的基础信息作为广播状态广播到下游Task中。

API介绍:

# DataStream是Keyed Stream
public abstract class KeyedBroadcastProcessFunction<KS, IN1, IN2, OUT> extends BaseBroadcastProcessFunction {
    public abstract void processElement(final IN1 value, final ReadOnlyContext ctx, final Collector<OUT> out) throws Exception;
    public abstract void processBroadcastElement(final IN2 value, final Context ctx, final Collector<OUT> out) throws Exception;
}

# Data Stream 是Non-Keyed Stream
public abstract class BroadcastProcessFunction<IN1, IN2, OUT> extends BaseBroadcastProcessFunction {
		public abstract void processElement(final IN1 value, final ReadOnlyContext ctx, final Collector<OUT> out) throws Exception;
		public abstract void processBroadcastElement(final IN2 value, final Context ctx, final Collector<OUT> out) throws Exception;
}

上面泛型中的各个参数的含义,说明如下:

  • KS:表示Flink 程序从最上游的Source Operator 开始构建Stream,当调用keyBy 时所依赖的Key 的类型;
  • IN1:表示非Broadcast 的Data Stream 中的数据记录的类型;
  • IN2:表示Broadcast Stream 中的数据记录的类型;
  • OUT:表示经过KeyedBroadcastProcessFunction 的processElement()和processBroadcastElement()方法处理后输出结果数据记录的类型。

2-相关案例

flink广播流详解 flink 广播_数据


实时过滤出配置中的用户,并在事件流中补全这批用户的基础信息

事件流:表示用户在某个时刻浏览或点击了某个商品,数据实时产生 数据量大 格式如下。
{"userID": "user_3", "eventTime": "2019-08-17 12:19:47", "eventType": "browse", "productID": 1}
{"userID": "user_2", "eventTime": "2019-08-17 12:19:48", "eventType": "click", "productID": 1}

配置数据: 表示用户的详细信息,在Mysql中,如下。
DROP TABLE IF EXISTS `user_info`;
CREATE TABLE `user_info`  (
  `userID` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `userName` varchar(10) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `userAge` int(11) NULL DEFAULT NULL,
  PRIMARY KEY (`userID`) USING BTREE
) ENGINE = MyISAM CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
-- ----------------------------
-- Records of user_info
-- ----------------------------
INSERT INTO `user_info` VALUES ('user_1', '张三', 10);
INSERT INTO `user_info` VALUES ('user_2', '李四', 20);
INSERT INTO `user_info` VALUES ('user_3', '王五', 30);
INSERT INTO `user_info` VALUES ('user_4', '赵六', 40);
SET FOREIGN_KEY_CHECKS = 1;

输出结果:
    (user_3,2019-08-17 12:19:47,browse,1,王五,33)
(user_2,2019-08-17 12:19:48,click,1,李四,20)

代码:

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.state.BroadcastState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ReadOnlyBroadcastState;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple6;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.BroadcastConnectedStream;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.util.Collector;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;

/**
 * @author liu a fu
 * @version 1.0
 * @date 2021/8/5 0005
 * @DESC   代码演示 BroadcastState
 */
public class BroadcastStateDemo {
    public static void main(String[] args) throws Exception {
        //TODO:1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
        env.setParallelism(1);


        //TODO 2.source
        //-1.构建实时数据事件流-自定义随机 数据较多  或者 说 实时产生的数据
        //<userID, eventTime, eventType, productID>
        DataStreamSource<Tuple4<String, String, String, Integer>> evenDS = env.addSource(new MySource());

        //-2.构建配置流-从MySQL  数据较少 或者 说 静态数据
        //<用户id,<姓名,年龄>>
        DataStreamSource<Map<String, Tuple2<String, Integer>>> userDS = env.addSource(new MySQLSource());


        //TODO: 3.transformation
        //1.定义状态描述器
        MapStateDescriptor<Void, Map<String, Tuple2<String, Integer>>> descriptor = new MapStateDescriptor<>("info", Types.VOID,
                Types.MAP(Types.STRING, Types.TUPLE(Types.STRING, Types.INT)));

        //-2.广播配置流
        BroadcastStream<Map<String, Tuple2<String, Integer>>> broadcastDS = userDS.broadcast(descriptor);

        //-3.将事件流和广播流进行连接
        BroadcastConnectedStream<Tuple4<String, String, String, Integer>, Map<String, Tuple2<String, Integer>>> connectDS = evenDS.connect(broadcastDS);

        //-4.处理连接后的流-根据配置流补全事件流中的用户的信息
        /**
         *  * @param <IN1> The input type of the non-broadcast side.
         *  * @param <IN2> The input type of the broadcast side.
         *  * @param <OUT> The output type of the operator.
         */
        SingleOutputStreamOperator<Tuple6<String, String, String, Integer, String, Integer>> resultDS = connectDS.process(new BroadcastProcessFunction
                //<userID, eventTime, eventType, productID> //事件流
                <Tuple4<String, String, String, Integer>,

                        //<用户id,<姓名,年龄>> //广播流
                        Map<String, Tuple2<String, Integer>>,

                        //<用户id,eventTime,eventType,productID,姓名,年龄> //结果流 需要收集的数据
                        Tuple6<String, String, String, Integer, String, Integer>>() {

            //处理事件流中的每一个元素
            @Override
            public void processElement(Tuple4<String, String, String, Integer> value,
                                       ReadOnlyContext ctx,
                                       Collector<Tuple6<String, String, String, Integer, String, Integer>> out) throws Exception {
                //value就是事件流中的数据
                //<userID, eventTime, eventType, productID> //事件流--已经有了

                //Tuple4<String, String, String, Integer>,
                //目标是将value和广播流中的数据进行关联,返回结果流
                //<用户id,<姓名,年龄>> //广播流--需要获取
                //Map<String, Tuple2<String, Integer>>

                //<用户id,eventTime,eventType,productID,姓名,年龄> //结果流 需要收集的数据
                // Tuple6<String, String, String, Integer, String, Integer>

                ReadOnlyBroadcastState<Void, Map<String, Tuple2<String, Integer>>> broadcastState = ctx.getBroadcastState(descriptor);
                //用户id,<姓名,年龄>
                Map<String, Tuple2<String, Integer>> map = broadcastState.get(null);
                if (null != map) {
                    //根据value中的用户id去map中获取用户信息
                    String userId = value.f0;
                    Tuple2<String, Integer> tuple2 = map.get(userId);
                    String username = tuple2.f0;
                    Integer age = tuple2.f1;

                    //收集数据
                    out.collect(Tuple6.of(userId, value.f1, value.f2, value.f3, username, age));
                }
            }

            //更新处理广播流中的数据
            @Override
            public void processBroadcastElement(Map<String, Tuple2<String, Integer>> value,
                                                Context ctx,
                                                Collector<Tuple6<String, String, String, Integer, String, Integer>> out) throws Exception {
                //value就是从MySQL中每隔5是查询出来并广播到状态中的最新数据!
                //要把最新的数据放到state中
                BroadcastState<Void, Map<String, Tuple2<String, Integer>>> broadcastState = ctx.getBroadcastState(descriptor);
                broadcastState.clear(); //清空旧数据
                broadcastState.put(null, value); //放入新数据

            }
        });

        //TODO: 4.sink
        resultDS.print();
        //TODO:5.execute
        env.execute();

    }


    //TODO: 准备的数据  静态内部类定义
    /**
     * 随机事件流--数据量较大
     * 用户id,时间,类型,产品id
     * <userID, eventTime, eventType, productID>
     */
    public static class MySource implements SourceFunction<Tuple4<String, String, String, Integer>> {

        private boolean isRunning = true;

        @Override
        public void run(SourceContext<Tuple4<String, String, String, Integer>> ctx) throws Exception {
            Random random = new Random();
            SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
            while (isRunning){
                int id = random.nextInt(4) + 1;
                String user_id = "user_" + id;
                String eventTime = df.format(new Date());
                String eventType = "type_" + random.nextInt(3);
                int productId = random.nextInt(4);
                ctx.collect(Tuple4.of(user_id,eventTime,eventType,productId));
                Thread.sleep(500);  //每隔0.5s产生一条数据
            }
        }

        //结束连接
        @Override
        public void cancel() {
            isRunning = false;
        }


    }


    /**
     * 配置流/规则流/用户信息流--数量较小
     * <用户id,<姓名,年龄>>
     */
    /*
CREATE TABLE `user_info` (
  `userID` varchar(20) NOT NULL,
  `userName` varchar(10) DEFAULT NULL,
  `userAge` int(11) DEFAULT NULL,
  PRIMARY KEY (`userID`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;

INSERT INTO `user_info` VALUES ('user_1', '张三', 10);
INSERT INTO `user_info` VALUES ('user_2', '李四', 20);
INSERT INTO `user_info` VALUES ('user_3', '王五', 30);
INSERT INTO `user_info` VALUES ('user_4', '赵六', 40);
     */
    public static class MySQLSource extends RichSourceFunction<Map<String, Tuple2<String, Integer>>> {
        private boolean flag = true;
        private Connection conn = null;
        private PreparedStatement ps = null;
        private ResultSet rs = null;

        //open方法  适合开启连接
        @Override
        public void open(Configuration parameters) throws Exception {
            conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata", "root", "root");
            String sql = "select `userID`, `userName`, `userAge` from `user_info`";
            ps = conn.prepareStatement(sql);
        }


        @Override
        public void run(SourceContext<Map<String, Tuple2<String, Integer>>> ctx) throws Exception {
            while (flag){
                Map<String, Tuple2<String, Integer>> map = new HashMap<>();
                ResultSet rs = ps.executeQuery();
                while (rs.next()){
                    String userID = rs.getString("userID");
                    String userName = rs.getString("userName");
                    int userAge = rs.getInt("userAge");
                    //Map<String, Tuple2<String, Integer>>
                    map.put(userID, Tuple2.of(userName,userAge));
                }
                ctx.collect(map);
                Thread.sleep(5000);//每隔5s更新一下用户的配置信息!
            }
        }

        //cancel遇到错误结束
        @Override
        public void cancel() {
            flag = false;
        }

        //Close方法  适合关闭连接
        @Override
        public void close() throws Exception {
            if (conn != null) conn.close();
            if (ps != null) ps.close();
            if (rs != null) rs.close();
        }
    }


}

flink广播流详解 flink 广播_apache_02


注意事项:

  1. Broadcast State 是Map 类型,即K-V 类型。
  2. Broadcast State 只有在广播的一侧, 即在BroadcastProcessFunction 或KeyedBroadcastProcessFunction 的processBroadcastElement 方法中可以修改。在非广播的一侧, 即在BroadcastProcessFunction 或KeyedBroadcastProcessFunction 的processElement 方法中只读
  3. Broadcast State 中元素的顺序,在各Task 中可能不同。基于顺序的处理,需要注意。
  4. Broadcast State 在Checkpoint 时,每个Task 都会Checkpoint 广播状态。
  5. Broadcast State 在运行时保存在内存中,目前还不能保存在==RocksDB State Backend ==中。