KafkaConnector使用方法

引言

Flink通过Kafka Connector提供从Kafka读取数据和向Kafka写入数据的功能,并通过Checkpoint机制实现了Exactly-Once的操作语义,在保证数据读取和写入准确性的同时能够查询对应的offset信息。

KafkaConsumner

基本使用篇

Flink通过KafkaConsumer从Kafka的一个(或多个)Topic中读取数据,形成数据流。
为使用Flink KafkaConnector相关的功能,需要在项目的pom.xml文件中引入:

<dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-kafka-0.8_2.11</artifactId>
        <version>1.4.2</version>
    </dependency>

Note: 如使用Kafka0.9相关依赖,可将artifactId改为:

<artifactId>flink-connector-kafka-0.9_2.11</artifactId>

接下来通过如下代码获取从Kafka读取数据的数据流:

Properties properties = new Properties();
// Kafka broker地址,以逗号分隔
properties.setProperty("bootstrap.servers", "localhost:9092");
// Zookepper服务器地址,以逗号分隔
properties.setProperty("zookeeper.connect", "localhost:2181");
// 读取数据的Group ID
properties.setProperty("group.id", "test");
FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(kafkaConsumer);

其中,FlinkKafkaConsumer08的构造函数签名为:\

public FlinkKafkaConsumer08(String topic, DeserializationSchema<T> valueDeserializer, Properties props)

三个重要参数分别为:
1、读取数据的一个或多个Topic名称,多个Topic之间以逗号分隔
2、定义从kafka读取数据的反序列化方式,详见高阶使用·数据反序列化
3、包含其他相关配置参数的的Properties

具体的使用实例如下:

import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer010;

import javax.annotation.Nullable;

/**
 * A simple example that shows how to read from and write to Kafka. This will read String messages
 * from the input topic, parse them into a POJO type {@link KafkaEvent}, group by some key, and finally
 * perform a rolling addition on each key for which the results are written back to another topic.
 *
 * <p>This example also demonstrates using a watermark assigner to generate per-partition
 * watermarks directly in the Flink Kafka consumer. For demonstration purposes, it is assumed that
 * the String messages are of formatted as a (word,frequency,timestamp) tuple.
 *
 * <p>Example usage:
 * 	--input-topic test-input --output-topic test-output --bootstrap.servers localhost:9092 --zookeeper.connect localhost:2181 --group.id myconsumer
 */
public class Kafka010Example {

	public static void main(String[] args) throws Exception {
		// parse input arguments
		final ParameterTool parameterTool = ParameterTool.fromArgs(args);

		if (parameterTool.getNumberOfParameters() < 5) {
			System.out.println("Missing parameters!\n" +
					"Usage: Kafka --input-topic <topic> --output-topic <topic> " +
					"--bootstrap.servers <kafka brokers> " +
					"--zookeeper.connect <zk quorum> --group.id <some id>");
			return;
		}

		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		env.getConfig().disableSysoutLogging();
		env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000));
		env.enableCheckpointing(5000); // create a checkpoint every 5 seconds
		env.getConfig().setGlobalJobParameters(parameterTool); // make parameters available in the web interface
		env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

		DataStream<KafkaEvent> input = env
				.addSource(
					new FlinkKafkaConsumer010<>(
						parameterTool.getRequired("input-topic"),
						new KafkaEventSchema(),
						parameterTool.getProperties())
					.assignTimestampsAndWatermarks(new CustomWatermarkExtractor()))
				.keyBy("word")
				.map(new RollingAdditionMapper());

		input.addSink(
				new FlinkKafkaProducer010<>(
						parameterTool.getRequired("output-topic"),
						new KafkaEventSchema(),
						parameterTool.getProperties()));

		env.execute("Kafka 0.10 Example");
	}

	/**
	 * A {@link RichMapFunction} that continuously outputs the current total frequency count of a key.
	 * The current total count is keyed state managed by Flink.
	 */
	private static class RollingAdditionMapper extends RichMapFunction<KafkaEvent, KafkaEvent> {

		private static final long serialVersionUID = 1180234853172462378L;

		private transient ValueState<Integer> currentTotalCount;

		@Override
		public KafkaEvent map(KafkaEvent event) throws Exception {
			Integer totalCount = currentTotalCount.value();

			if (totalCount == null) {
				totalCount = 0;
			}
			totalCount += event.getFrequency();

			currentTotalCount.update(totalCount);

			return new KafkaEvent(event.getWord(), totalCount, event.getTimestamp());
		}

		@Override
		public void open(Configuration parameters) throws Exception {
			currentTotalCount = getRuntimeContext().getState(new ValueStateDescriptor<>("currentTotalCount", Integer.class));
		}
	}

	/**
	 * A custom {@link AssignerWithPeriodicWatermarks}, that simply assumes that the input stream
	 * records are strictly ascending.
	 *
	 * <p>Flink also ships some built-in convenience assigners, such as the
	 * {@link BoundedOutOfOrdernessTimestampExtractor} and {@link AscendingTimestampExtractor}
	 */
	private static class CustomWatermarkExtractor implements AssignerWithPeriodicWatermarks<KafkaEvent> {

		private static final long serialVersionUID = -742759155861320823L;

		private long currentTimestamp = Long.MIN_VALUE;

		@Override
		public long extractTimestamp(KafkaEvent event, long previousElementTimestamp) {
			// the inputs are assumed to be of format (message,timestamp)
			this.currentTimestamp = event.getTimestamp();
			return event.getTimestamp();
		}

		@Nullable
		@Override
		public Watermark getCurrentWatermark() {
			return new Watermark(currentTimestamp == Long.MIN_VALUE ? Long.MIN_VALUE : currentTimestamp - 1);
		}
	}
}

高阶使用篇

一、支持多Topic

FlinkKafkaConsumer可支持同时从多个Topic中消费数据,具体方法可在构造函数中设置以逗号分隔的多个Topic名称:

new FlinkKafkaConsumer08<>("topic1,topic2,topic3", new SimpleStringSchema(), properties)

同时,可以通过正则表达式设置匹配多个Topic名称,这样可在作业启动后自动匹配发现新的Topic并从中消费数据。例如:

FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011<>(
    java.util.regex.Pattern.compile("test-topic-[0-9]"),
    new SimpleStringSchema(),
    properties);

通过以下参数,可以设置作业在检测Topic的时间间隔,集群可发现新的Topic并从中消费数据

flink.partition-discovery.interval-millis
二、起始Offset配置

FlinkKafkaConsumer可支持作业从Topic的指定offset处开始消费数据,共有以下几种消费方式:
1、指定作业从每个partition最早的起始位置开始消费数据

FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromEarliest();

2、指定作业从每个partition最近(最晚)的位置开始消费数据

FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromLatest();

3.指定作业从group中每个partition的当前位置开始消费数据,要求当前consumer已经指定了groupID

FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromGroupOffsets();

4.分别指定作业从不同partition的不同offset处开始消费数据

FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);

Map<KafkaTopicPartition, Long> specificStartOffsets = new HashMap<>();
specificStartOffsets.put(new KafkaTopicPartition("topic", 0), 23L);
specificStartOffsets.put(new KafkaTopicPartition("topic", 1), 31L);
specificStartOffsets.put(new KafkaTopicPartition("topic", 2), 43L);

kafkaConsumer.setStartFromSpecificOffsets(specificStartOffsets);
三、Checkpoint与offset commit模式

为保证处理数据的的准确性(Exactly-Once),可通过以下方式开启作业的checkpoint机制:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000); // checkpoint every 5000 msecs

1、当作业开启checkpoint机制时,KafkaConsumer会在每次checkpoint执行完成后,将当前消费数据的offset信息commit到zookepper(Kafka08)或broker(Kafka09)。
2、当作业开启checkpoint机制时,用户可通过以下方式关闭offset commit操作,则作业不会checkpoint完成后执行offset commit。

kafkaConsumer.setCommitOffsetsOnCheckpoints(false);

3、当作业未开启checkpoint机制时,KafkaConsumer会周期性的commit当前消费数据的offset信息,用户可通过以下方式设置commit的时间间隔(默认为60s)。

Properties properties = new Properties();
properties.setProperty("auto.commit.interval.ms", 60);

FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);

4、用户可通过以下参数关闭周期性commit offset模式,则当作业未开启checkpoint机制时,KafkaConsumer不会执行commit offset操作。

Properties properties = new Properties();
properties.setProperty("auto.commit.enable", false);

FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
四、自定义反序列化方式

为了能够从Kafka中消费不同格式的数据,需要为kafkaConsumer设置特定的DeserializationSchema,使其能够将消费到的byte[]数组反序列化为对应的Java/Scala对象。在以上的例子中均使用SimpleStringSchema,将数据反序列化为String类型,用户还可以通过以下两种方式设置自定义的反序列化方式:
1、通过基于TypeInformation类构造Flink本身提供的TypeInformationSerializationSchema,实现对Flink原生支持数据类型的反序列化,例如:

TypeInformationSerializationSchema serializationSchema =
                new TypeInformationSerializationSchema<Tuple2<String,String>>(
                        TypeInformation.of(new TypeHint<Tuple2<String,String>>() {
                        }),
                        env.getConfig()
                );
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08("topic",serializationSchema,properties);
DataStream<Tuple2<String,String>> stream = env.addSource(kafkaConsumer);

2、通过使用JsonDeserializationSchema实现对JSON格式数据的反序列化,具体的使用方法如下:

JSONDeserializationSchema deserializationSchema = new JSONDeserializationSchema();
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08("topic",deserializationSchema,properties);
DataStream<ObjectNode> stream = env.addSource(kafkaConsumer);

2、继承AbstractDeserializationSchema抽象类实现用户自定义的反序列工具类,并实现以下两个方法:

public abstract T deserialize(byte[] message) throws IOException;
public TypeInformation<T> getProducedType();

其中,deserialize()方法执行具体的反序列化操作,而getProducedType()方法用于获取反序列化对象的类型。

Note: 当反序列化数据出现异常时,有以下两种可供选择的操作
1.由反序列化方法抛出Exception,会导致整个作业失败重启;
2.跳过当前数据,并继续处理下一条数据。保证作业正常运行,但可能会影响数据处理的准确性。