自定义flink-kafka-connector

背景:消费特定开始和结束位置的kafka数据,需求是执行flink任务消费完特定位置数据后,任务自行停止。但批任务并不支持消费kafka数据,而流任务不会自行停止,因此需要自定义kafka连接器。flink1.14版本中,DataStream Connectors 有一个属性setBounded,可以设定消费的结束位置,但Table API目前并不支持设定结束消费位置,正好可以模仿着DataStream修改源代码。

本文主要参考这篇文章,这篇文章给我很大的帮助,在此基础对一些细节地方进行了修改,其中修改了两处比较关键的地方,最终满足了需求。首先是修改流批任务判断条件,保证在批任务情况下,消费到kafka中的数据,其次保证任务消费到指定位置后任务停止,最后是进行打包测试,打包过程中注意格式,网络也会有一定的影响,后续也会将jar包放到后面,可直接使用。
flink-connector-kafka_2.11-1.14.4.jar

主要的修改地方

1,批模式处理流数据
在KafkaSourceBuilder中设置有限数据标识Boundedness.BOUNDED,保证能做批任务情况下处理kfka流数据,这个标识也可在其他位置设置,根据自己需要进行设置。
2,设置结束偏移位置,仿照开始偏移位置设置结束偏移位置
在KafkaSourceBuilder新建setEndOffsets方法并给stoppingOffsetsInitializer属性赋值
3,设置结束偏移位置的方式
在KafkaConnectorOptionsUtil里面,仿照开始getStartupOptions方法新建getEndupOptions方法,针对特定偏移位置进行针对性修改,同样还是仿照开始位置进行设置,具体修改请参照一下内容。

注意事项;

代码格式需要注意,换行,空格都不能多,不然打包的时候无法通过,还有就是import的时候也要注意,避免IDEA自动导入的问题,不然打包也会失败。
操作步骤如下
1, File->settings->Editor->Code Style->java->imports
2, Class count to use import with '
’ 值为100
3, Names count to use static import with ‘*’ 值为100
之后就可以顺利的将自定义jar包打包成功,之后直接替换本地项目中的flink-kafka-connector.jar,注意名字要完成匹配,这样后续代码运行,使用的就是修改后的jar包。

具体修改如下:

下载flink源代码(最好通过中文github网站进行下载,比较快),有7个需要修改的地方:

1,KafkaSourceBuilder

flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/KafkaSourceBuilder.java

flink消费kafka消息入库hive flink消费kafka消费组设置_java

public KafkaSourceBuilder<OUT> setEndOffsets(OffsetsInitializer stoppingOffsetsInitializer) {		//这个地方设置结束偏移位置,是整个修改的核心
    this.stoppingOffsetsInitializer = stoppingOffsetsInitializer;
    this.boundedness = Boundedness.BOUNDED;//设置边界,保证批任务处理kafka(流)数据
    return this;
}

仿照开始偏移位置,设置结束偏移位置,这里之所以设置 this.boundedness = Boundedness.BOUNDED,是因为批任务并不支持消费kafka(流)类型数据,不设置会报如下错误:

Querying an unbounded table '%s' in batch mode is not allowed. "
        + "The table source is unbounded.

也可在其他方式设置该属性,这个根据自己的需求可自行调整。

2,EndupMode

flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/config/EndupMode.java

新增一个EndupMode配置文件,也是仿照开始的配置文件编写

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.streaming.connectors.kafka.config;

import org.apache.flink.annotation.Internal;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartitionStateSentinel;

/** End modes for the Kafka Consumer. */
@Internal
public enum EndupMode {

    /** End from committed offsets in ZK / Kafka brokers of a specific consumer group (default). */
    GROUP_OFFSETS(KafkaTopicPartitionStateSentinel.GROUP_OFFSET),

    /** End from the latest offset. */
    LATEST(KafkaTopicPartitionStateSentinel.LATEST_OFFSET),

    /**
     * Start from user-supplied timestamp for each partition. Since this mode will have specific
     * offsets to start with, we do not need a sentinel value; using Long.MIN_VALUE as a
     * placeholder.
     */
    TIMESTAMP(Long.MIN_VALUE),

    /**
     * Start from user-supplied specific offsets for each partition. Since this mode will have
     * specific offsets to start with, we do not need a sentinel value; using Long.MIN_VALUE as a
     * placeholder.
     */
    SPECIFIC_OFFSETS(Long.MIN_VALUE);

    /** The sentinel offset value corresponding to this startup mode. */
    private long stateSentinel;

    EndupMode(long stateSentinel) {
        this.stateSentinel = stateSentinel;
    }
}

3, KafkaConnectorOptions

设置结束消费kafka的相关参数

flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/table/KafkaConnectorOptions.java

flink消费kafka消息入库hive flink消费kafka消费组设置_java_02


flink消费kafka消息入库hive flink消费kafka消费组设置_kafka_03


仿照开始配置编写结束配置,这里也很重要,后续调用的方式可根据此处编写。

参数名

参数值

scan.startup.mode

可选值:‘earliest-offset’, ‘latest-offset’, ‘group-offsets’, ‘timestamp’ and ‘specific-offsets’

scan.startup.specific-offsets

指定每个分区的偏移量,比如:‘partition:0,offset:42;partition:1,offset:300’

scan.startup.timestamp-millis

直接指定开始时间戳,long类型

scan.endup.mode

可选值:‘latest-offset’, ‘group-offsets’, ‘timestamp’ and ‘specific-offsets’

scan.endup.specific-offsets

指定每个分区的偏移量,比如:‘partition:0,offset:42;partition:1,offset:300’

scan.sendup.timestamp-millis

直接指定结束时间戳,long类型

public static final ConfigOption<ScanEndupMode> SCAN_ENDUP_MODE =
        ConfigOptions.key("scan.endup.mode")
                .enumType(ScanEndupMode.class)
                .defaultValue(ScanEndupMode.GROUP_OFFSETS)
                .withDescription("Endup mode for Kafka consumer.");

public static final ConfigOption<String> SCAN_ENDUP_SPECIFIC_OFFSETS =
        ConfigOptions.key("scan.endup.specific-offsets")
                .stringType()
                .noDefaultValue()
                .withDescription(
                        "Optional offsets used in case of \"specific-offsets\" endup mode");

public static final ConfigOption<Long> SCAN_ENDUP_TIMESTAMP_MILLIS =
        ConfigOptions.key("scan.endup.timestamp-millis")
                .longType()
                .noDefaultValue()
                .withDescription("Optional timestamp used in case of \"timestamp\" endup mode");
/** Endup mode for the Kafka consumer, see {@link #SCAN_ENDUP_MODE}. */
public enum ScanEndupMode implements DescribedEnum {
    LATEST_OFFSET("latest-offset", text("End from the latest offset.")),
    GROUP_OFFSETS(
            "group-offsets",
            text(
                    "End from committed offsets in ZooKeeper / Kafka brokers of a specific consumer group.")),
    TIMESTAMP("timestamp", text("End from user-supplied timestamp for each partition.")),
    SPECIFIC_OFFSETS(
            "specific-offsets",
            text("End from user-supplied specific offsets for each partition."));
    private final String value;
    private final InlineElement description;

    ScanEndupMode(String value, InlineElement description) {
        this.value = value;
        this.description = description;
    }
    @Override
    public String toString() {
        return value;
    }
    @Override
    public InlineElement getDescription() {
        return description;
    }
}

4,KafkaConnectorOptionsUtil

kafka结束消费位置,根据参数创建相关偏移量对象

flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/table/KafkaConnectorOptionsUtil.java

flink消费kafka消息入库hive flink消费kafka消费组设置_kafka_04


flink消费kafka消息入库hive flink消费kafka消费组设置_flink_05


flink消费kafka消息入库hive flink消费kafka消费组设置_flink_06


flink消费kafka消息入库hive flink消费kafka消费组设置_java_07

public static EndupOptions getEndupOptions(ReadableConfig tableOptions) {
    final Map<KafkaTopicPartition, Long> specificOffsets = new HashMap<>();
    final EndupMode endupMode =
            tableOptions
                    .getOptional(KafkaConnectorOptions.SCAN_ENDUP_MODE)
                    .map(KafkaConnectorOptionsUtil::endfromOption)
                    .orElse(EndupMode.GROUP_OFFSETS);
  //这个地方需要注意一下,需要创建一个获取结束偏移位置的方法
    if (endupMode == EndupMode.SPECIFIC_OFFSETS) {
        buildSpecificEndOffsets(tableOptions, tableOptions.get(TOPIC).get(0), specificOffsets);
    }
		//
    final EndupOptions options = new EndupOptions();
    options.endupMode = endupMode;
    options.specificOffsets = specificOffsets;
    if (endupMode == EndupMode.TIMESTAMP) {
        options.endupTimestampMillis = tableOptions.get(SCAN_ENDUP_TIMESTAMP_MILLIS);
    }
    return options;
}
private static void buildSpecificEndOffsets(
        ReadableConfig tableOptions,
        String topic,
        Map<KafkaTopicPartition, Long> specificOffsets) {
    String specificOffsetsStrOpt = tableOptions.get(SCAN_ENDUP_SPECIFIC_OFFSETS);
    final Map<Integer, Long> offsetMap =
            parseSpecificOffsets(specificOffsetsStrOpt, SCAN_ENDUP_SPECIFIC_OFFSETS.key());
    offsetMap.forEach(
            (partition, offset) -> {
                final KafkaTopicPartition topicPartition =
                        new KafkaTopicPartition(topic, partition);
                specificOffsets.put(topicPartition, offset);
            });
}

这个地方是仿照开始偏移位置进行编写,这个其实很容易看出来,但是由于最开始不理解浪费了好长时间。仿照buildSpecificOffsets 进行编写,根据开始SCAN_STARTUP_SPECIFIC_OFFSETS,设置SCAN_ENDUP_SPECIFIC_OFFSETS,这个属性对应的值是从前端建表的时候传入的。

private static EndupMode endfromOption(KafkaConnectorOptions.ScanEndupMode scanEndupMode) {
    switch (scanEndupMode) {
        case LATEST_OFFSET:
            return EndupMode.LATEST;
        case GROUP_OFFSETS:
            return EndupMode.GROUP_OFFSETS;
        case SPECIFIC_OFFSETS:
            return EndupMode.SPECIFIC_OFFSETS;
        case TIMESTAMP:
            return EndupMode.TIMESTAMP;

        default:
            throw new TableException(
                    "Unsupported endup mode. Validator should have checked that.");
    }
}
/** Kafka endup options. * */
public static class EndupOptions {
    public EndupMode endupMode;
    public Map<KafkaTopicPartition, Long> specificOffsets;
    public long endupTimestampMillis;
}

5,KafkaDynamicSource

对应前面的修改,后续创建数据源方法也要修改,将新增的参数加入即可

flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/table/KafkaDynamicSource.java

flink消费kafka消息入库hive flink消费kafka消费组设置_java_08


flink消费kafka消息入库hive flink消费kafka消费组设置_flink_09


flink消费kafka消息入库hive flink消费kafka消费组设置_flink_10


flink消费kafka消息入库hive flink消费kafka消费组设置_flink_11

6,KafkaDynamicTableFactory

同理跟随前面新增的参数,后续创建对象也需加上

flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/table/KafkaDynamicTableFactory.java

flink消费kafka消息入库hive flink消费kafka消费组设置_kafka_12


flink消费kafka消息入库hive flink消费kafka消费组设置_flink_13

7,UpsertKafkaDynamicTableFactory

同理跟随前面新增的参数,后续创建对象也需加上

flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/table/UpsertKafkaDynamicTableFactory.java

flink消费kafka消息入库hive flink消费kafka消费组设置_flink_14


以上是全部需要修改的地方,下面测试文件会在打包时会报错,也需要修改一下了

flink消费kafka消息入库hive flink消费kafka消费组设置_flink_15


flink消费kafka消息入库hive flink消费kafka消费组设置_kafka_16

具体的测试代码如下:

1,创建运行环境

EnvironmentSettings settings = EnvironmentSettings.newInstance().inBatchMode().build();
StreamExecutionEnvironment streamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment streamTableEnvironment = StreamTableEnvironment.create(streamExecutionEnvironment,settings);

2,建表

有3种设置结束偏移位置的方式,以下是具体案例

// 1,建表语句,latest-offset
//  'scan.endup.mode' = 'specific-offsets',\n" +
String connectSql = "CREATE TABLE KafkaTable (\n" +
        "  `user_id` BIGINT,\n" +
        "  `item_id` BIGINT,\n" +
        "  `age` BIGINT\n" +
        ") WITH (\n" +
        "  'connector' = 'kafka',\n" +
        "  'topic' = 'test02',\n" +
        "  'properties.bootstrap.servers' = 'localhost:9092',\n" +
        "  'properties.group.id' = 'testGroup',\n" +
        "  'scan.startup.mode' = 'earliest-offset',\n" +
        "  'scan.endup.mode' = 'latest-offset',\n" +
        "  'format' = 'csv'\n" +
        ")";
// 2,建立连接sql 特定偏移位置
//  'scan.endup.mode' = 'specific-offsets',\n" +
//  'scan.endup.specific-offsets' = 'partition:0,offset:22',\n" +
String connectSql = "CREATE TABLE KafkaTable (\n" +
        "  `user_id` BIGINT,\n" +
        "  `item_id` BIGINT,\n" +
        "  `age` BIGINT\n" +
        ") WITH (\n" +
        "  'connector' = 'kafka',\n" +
        "  'topic' = 'test02',\n" +
        "  'properties.bootstrap.servers' = 'localhost:9092',\n" +
        "  'properties.group.id' = 'testGroup',\n" +
        "  'scan.startup.mode' = 'earliest-offset',\n" +
        "  'scan.endup.mode' = 'specific-offsets',\n" +
        "  'scan.endup.specific-offsets' = 'partition:0,offset:22',\n" +
        "  'format' = 'csv'\n" +
        ")";
// 3,建立连接sql 特定时间点
// "  'scan.endup.mode' = 'timestamp',\n" +
// "  'scan.endup.timestamp-millis' = '1648124880000',\n" +
String connectSql = "CREATE TABLE KafkaTable (\n" +
        "  `user_id` BIGINT,\n" +
        "  `item_id` BIGINT,\n" +
        "  `age` BIGINT\n" +
        ") WITH (\n" +
        "  'connector' = 'kafka',\n" +
        "  'topic' = 'test02',\n" +
        "  'properties.bootstrap.servers' = 'localhost:9092',\n" +
        "  'properties.group.id' = 'testGroup',\n" +
        "  'scan.startup.mode' = 'earliest-offset',\n" +
        "  'scan.endup.mode' = 'timestamp',\n" +
        "  'scan.endup.timestamp-millis' = '1648124880000',\n" +
        "  'format' = 'csv'\n" +
        ")";
//执行sql创建表
streamTableEnvironment.executeSql(connectSql);

3,输出逻辑

//查询逻辑
Table result = streamTableEnvironment.sqlQuery("select user_id ,item_id,age from KafkaTable");
//表数据转流数据 方便输出
DataStream<Row> rowDataStream = streamTableEnvironment.toDataStream(result);

streamExecutionEnvironment.execute();