KafkaConnector使用方法
引言
Flink通过Kafka Connector提供从Kafka读取数据和向Kafka写入数据的功能,并通过Checkpoint机制实现了Exactly-Once的操作语义,在保证数据读取和写入准确性的同时能够查询对应的offset信息。
KafkaConsumner
基本使用篇
Flink通过KafkaConsumer从Kafka的一个(或多个)Topic中读取数据,形成数据流。
为使用Flink KafkaConnector相关的功能,需要在项目的pom.xml文件中引入:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.8_2.11</artifactId>
<version>1.4.2</version>
</dependency>
Note: 如使用Kafka0.9相关依赖,可将artifactId改为:
<artifactId>flink-connector-kafka-0.9_2.11</artifactId>
接下来通过如下代码获取从Kafka读取数据的数据流:
Properties properties = new Properties();
// Kafka broker地址,以逗号分隔
properties.setProperty("bootstrap.servers", "localhost:9092");
// Zookepper服务器地址,以逗号分隔
properties.setProperty("zookeeper.connect", "localhost:2181");
// 读取数据的Group ID
properties.setProperty("group.id", "test");
FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(kafkaConsumer);
其中,FlinkKafkaConsumer08的构造函数签名为:\
public FlinkKafkaConsumer08(String topic, DeserializationSchema<T> valueDeserializer, Properties props)
三个重要参数分别为:
1、读取数据的一个或多个Topic名称,多个Topic之间以逗号分隔
2、定义从kafka读取数据的反序列化方式,详见高阶使用·数据反序列化
3、包含其他相关配置参数的的Properties
具体的使用实例如下:
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer010;
import javax.annotation.Nullable;
/**
* A simple example that shows how to read from and write to Kafka. This will read String messages
* from the input topic, parse them into a POJO type {@link KafkaEvent}, group by some key, and finally
* perform a rolling addition on each key for which the results are written back to another topic.
*
* <p>This example also demonstrates using a watermark assigner to generate per-partition
* watermarks directly in the Flink Kafka consumer. For demonstration purposes, it is assumed that
* the String messages are of formatted as a (word,frequency,timestamp) tuple.
*
* <p>Example usage:
* --input-topic test-input --output-topic test-output --bootstrap.servers localhost:9092 --zookeeper.connect localhost:2181 --group.id myconsumer
*/
public class Kafka010Example {
public static void main(String[] args) throws Exception {
// parse input arguments
final ParameterTool parameterTool = ParameterTool.fromArgs(args);
if (parameterTool.getNumberOfParameters() < 5) {
System.out.println("Missing parameters!\n" +
"Usage: Kafka --input-topic <topic> --output-topic <topic> " +
"--bootstrap.servers <kafka brokers> " +
"--zookeeper.connect <zk quorum> --group.id <some id>");
return;
}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().disableSysoutLogging();
env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000));
env.enableCheckpointing(5000); // create a checkpoint every 5 seconds
env.getConfig().setGlobalJobParameters(parameterTool); // make parameters available in the web interface
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<KafkaEvent> input = env
.addSource(
new FlinkKafkaConsumer010<>(
parameterTool.getRequired("input-topic"),
new KafkaEventSchema(),
parameterTool.getProperties())
.assignTimestampsAndWatermarks(new CustomWatermarkExtractor()))
.keyBy("word")
.map(new RollingAdditionMapper());
input.addSink(
new FlinkKafkaProducer010<>(
parameterTool.getRequired("output-topic"),
new KafkaEventSchema(),
parameterTool.getProperties()));
env.execute("Kafka 0.10 Example");
}
/**
* A {@link RichMapFunction} that continuously outputs the current total frequency count of a key.
* The current total count is keyed state managed by Flink.
*/
private static class RollingAdditionMapper extends RichMapFunction<KafkaEvent, KafkaEvent> {
private static final long serialVersionUID = 1180234853172462378L;
private transient ValueState<Integer> currentTotalCount;
@Override
public KafkaEvent map(KafkaEvent event) throws Exception {
Integer totalCount = currentTotalCount.value();
if (totalCount == null) {
totalCount = 0;
}
totalCount += event.getFrequency();
currentTotalCount.update(totalCount);
return new KafkaEvent(event.getWord(), totalCount, event.getTimestamp());
}
@Override
public void open(Configuration parameters) throws Exception {
currentTotalCount = getRuntimeContext().getState(new ValueStateDescriptor<>("currentTotalCount", Integer.class));
}
}
/**
* A custom {@link AssignerWithPeriodicWatermarks}, that simply assumes that the input stream
* records are strictly ascending.
*
* <p>Flink also ships some built-in convenience assigners, such as the
* {@link BoundedOutOfOrdernessTimestampExtractor} and {@link AscendingTimestampExtractor}
*/
private static class CustomWatermarkExtractor implements AssignerWithPeriodicWatermarks<KafkaEvent> {
private static final long serialVersionUID = -742759155861320823L;
private long currentTimestamp = Long.MIN_VALUE;
@Override
public long extractTimestamp(KafkaEvent event, long previousElementTimestamp) {
// the inputs are assumed to be of format (message,timestamp)
this.currentTimestamp = event.getTimestamp();
return event.getTimestamp();
}
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(currentTimestamp == Long.MIN_VALUE ? Long.MIN_VALUE : currentTimestamp - 1);
}
}
}
高阶使用篇
一、支持多Topic
FlinkKafkaConsumer可支持同时从多个Topic中消费数据,具体方法可在构造函数中设置以逗号分隔的多个Topic名称:
new FlinkKafkaConsumer08<>("topic1,topic2,topic3", new SimpleStringSchema(), properties)
同时,可以通过正则表达式设置匹配多个Topic名称,这样可在作业启动后自动匹配发现新的Topic并从中消费数据。例如:
FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011<>(
java.util.regex.Pattern.compile("test-topic-[0-9]"),
new SimpleStringSchema(),
properties);
通过以下参数,可以设置作业在检测Topic的时间间隔,集群可发现新的Topic并从中消费数据
flink.partition-discovery.interval-millis
二、起始Offset配置
FlinkKafkaConsumer可支持作业从Topic的指定offset处开始消费数据,共有以下几种消费方式:
1、指定作业从每个partition最早的起始位置开始消费数据
FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromEarliest();
2、指定作业从每个partition最近(最晚)的位置开始消费数据
FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromLatest();
3.指定作业从group中每个partition的当前位置开始消费数据,要求当前consumer已经指定了groupID
FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromGroupOffsets();
4.分别指定作业从不同partition的不同offset处开始消费数据
FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
Map<KafkaTopicPartition, Long> specificStartOffsets = new HashMap<>();
specificStartOffsets.put(new KafkaTopicPartition("topic", 0), 23L);
specificStartOffsets.put(new KafkaTopicPartition("topic", 1), 31L);
specificStartOffsets.put(new KafkaTopicPartition("topic", 2), 43L);
kafkaConsumer.setStartFromSpecificOffsets(specificStartOffsets);
三、Checkpoint与offset commit模式
为保证处理数据的的准确性(Exactly-Once),可通过以下方式开启作业的checkpoint机制:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000); // checkpoint every 5000 msecs
1、当作业开启checkpoint机制时,KafkaConsumer会在每次checkpoint执行完成后,将当前消费数据的offset信息commit到zookepper(Kafka08)或broker(Kafka09)。
2、当作业开启checkpoint机制时,用户可通过以下方式关闭offset commit操作,则作业不会checkpoint完成后执行offset commit。
kafkaConsumer.setCommitOffsetsOnCheckpoints(false);
3、当作业未开启checkpoint机制时,KafkaConsumer会周期性的commit当前消费数据的offset信息,用户可通过以下方式设置commit的时间间隔(默认为60s)。
Properties properties = new Properties();
properties.setProperty("auto.commit.interval.ms", 60);
FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
4、用户可通过以下参数关闭周期性commit offset模式,则当作业未开启checkpoint机制时,KafkaConsumer不会执行commit offset操作。
Properties properties = new Properties();
properties.setProperty("auto.commit.enable", false);
FlinkKafkaConsumer08<String> kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
四、自定义反序列化方式
为了能够从Kafka中消费不同格式的数据,需要为kafkaConsumer设置特定的DeserializationSchema,使其能够将消费到的byte[]数组反序列化为对应的Java/Scala对象。在以上的例子中均使用SimpleStringSchema,将数据反序列化为String类型,用户还可以通过以下两种方式设置自定义的反序列化方式:
1、通过基于TypeInformation类构造Flink本身提供的TypeInformationSerializationSchema,实现对Flink原生支持数据类型的反序列化,例如:
TypeInformationSerializationSchema serializationSchema =
new TypeInformationSerializationSchema<Tuple2<String,String>>(
TypeInformation.of(new TypeHint<Tuple2<String,String>>() {
}),
env.getConfig()
);
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08("topic",serializationSchema,properties);
DataStream<Tuple2<String,String>> stream = env.addSource(kafkaConsumer);
2、通过使用JsonDeserializationSchema实现对JSON格式数据的反序列化,具体的使用方法如下:
JSONDeserializationSchema deserializationSchema = new JSONDeserializationSchema();
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08("topic",deserializationSchema,properties);
DataStream<ObjectNode> stream = env.addSource(kafkaConsumer);
2、继承AbstractDeserializationSchema抽象类实现用户自定义的反序列工具类,并实现以下两个方法:
public abstract T deserialize(byte[] message) throws IOException;
public TypeInformation<T> getProducedType();
其中,deserialize()方法执行具体的反序列化操作,而getProducedType()方法用于获取反序列化对象的类型。
Note: 当反序列化数据出现异常时,有以下两种可供选择的操作
1.由反序列化方法抛出Exception,会导致整个作业失败重启;
2.跳过当前数据,并继续处理下一条数据。保证作业正常运行,但可能会影响数据处理的准确性。