1.Flink-kafka 两阶段提交源码分析
-
TwoPhaseCommitSinkFunction 分析
2.Flink 中 notifyCheckpointComplete 方法调用顺序
-
定义
-
样例
-
operator 调用 notifyCheckpointComplete
-
对 Exactly-Once 语义的影响
Flink-kafka 两阶段提交源码分析
FlinkKafkaProducer 实现了 TwoPhaseCommitSinkFunction,也就是两阶段提交。关于两阶段提交的原理,可以参见《An Overview of End-to-End Exactly-Once Processing in Apache Flink》,本文不再赘述两阶段提交的原理,但是会分析 FlinkKafkaProducer 源码中是如何实现两阶段提交的,并保证了在结合 Kafka 的时候做到端到端的 Exactly Once 语义的。
https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
TwoPhaseCommitSinkFunction 分析
TwoPhaseCommitSinkFunction 实现了 CheckpointedFunction 和 CheckpointListener 接口,首先就是在 initializeState 方法中开启事务,对于 Flink sink 的两阶段提交,第一阶段就是执行 CheckpointedFunction#snapshotState 当所有 task 的 checkpoint 都完成之后,每个 task 会执行 CheckpointedFunction#notifyCheckpointComplete 也就是所谓的第二阶段。public abstract class TwoPhaseCommitSinkFunction<IN, TXN, CONTEXT>
extends RichSinkFunction<IN>
implements CheckpointedFunction, CheckpointListener
■ FlinkKafkaProducer 第一阶段分析
这部分代码的核心在于:@Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
// this is like the pre-commit of a 2-phase-commit transaction
// we are ready to commit and remember the transaction
checkState(currentTransactionHolder != null, "bug: no transaction object when performing state snapshot");
long checkpointId = context.getCheckpointId();
LOG.debug("{} - checkpoint {} triggered, flushing transaction '{}'", name(), context.getCheckpointId(), currentTransactionHolder);
preCommit(currentTransactionHolder.handle);
pendingCommitTransactions.put(checkpointId, currentTransactionHolder);
LOG.debug("{} - stored pending transactions {}", name(), pendingCommitTransactions);
currentTransactionHolder = beginTransactionInternal();
LOG.debug("{} - started new transaction '{}'", name(), currentTransactionHolder);
state.clear();
state.add(new State<>(
this.currentTransactionHolder,
new ArrayList<>(pendingCommitTransactions.values()),
userContext));
}
- 先执行 preCommit 方法,EXACTLY_ONCE 模式下会调 flush,立即将数据发送到指定的 topic,这时如果消费这个 topic,需要指定 isolation.level 为 read_committed 表示消费端应用不可以看到未提交的事物内的消息。
注意第一次调用的 send 和 flush 的事务都是在 initializeState 方法中开启事务。@Override
protected void preCommit(FlinkKafkaProducer.KafkaTransactionState transaction) throws FlinkKafkaException {
switch (semantic) {
case EXACTLY_ONCE:
case AT_LEAST_ONCE:
flush(transaction);
break;
case NONE:
break;
default:
throw new UnsupportedOperationException("Not implemented semantic");
}
checkErroneous();
}
transaction.producer.send(record, callback);
transaction.producer.flush();
- pendingCommitTransactions 保存了每个 checkpoint 对应的事务,并为下一次 checkpoint 创建新的 producer 事务,即 currentTransactionHolder = beginTransactionInternal();下一次的 send 和 flush 都会在这个事务中。也就是说第一阶段每一个 checkpoint 都有自己的事务,并保存在 pendingCommitTransactions 中。
■ FlinkKafkaProducer 第二阶段分析
当所有 checkpoint 都完成后,会进入第二阶段的提交。这一阶段会将 pendingCommitTransactions 中的事务全部提交。@Override
public final void notifyCheckpointComplete(long checkpointId) throws Exception {
// the following scenarios are possible here
//
// (1) there is exactly one transaction from the latest checkpoint that
// was triggered and completed. That should be the common case.
// Simply commit that transaction in that case.
//
// (2) there are multiple pending transactions because one previous
// checkpoint was skipped. That is a rare case, but can happen
// for example when:
//
// - the master cannot persist the metadata of the last
// checkpoint (temporary outage in the storage system) but
// could persist a successive checkpoint (the one notified here)
//
// - other tasks could not persist their status during
// the previous checkpoint, but did not trigger a failure because they
// could hold onto their state and could successfully persist it in
// a successive checkpoint (the one notified here)
//
// In both cases, the prior checkpoint never reach a committed state, but
// this checkpoint is always expected to subsume the prior one and cover all
// changes since the last successful one. As a consequence, we need to commit
// all pending transactions.
//
// (3) Multiple transactions are pending, but the checkpoint complete notification
// relates not to the latest. That is possible, because notification messages
// can be delayed (in an extreme case till arrive after a succeeding checkpoint
// was triggered) and because there can be concurrent overlapping checkpoints
// (a new one is started before the previous fully finished).
//
// ==> There should never be a case where we have no pending transaction here
//
Iterator<Map.Entry<Long, TransactionHolder<TXN>>> pendingTransactionIterator = pendingCommitTransactions.entrySet().iterator();
checkState(pendingTransactionIterator.hasNext(), "checkpoint completed, but no transaction pending");
Throwable firstError = null;
while (pendingTransactionIterator.hasNext()) {
Map.Entry<Long, TransactionHolder<TXN>> entry = pendingTransactionIterator.next();
Long pendingTransactionCheckpointId = entry.getKey();
TransactionHolder<TXN> pendingTransaction = entry.getValue();
if (pendingTransactionCheckpointId > checkpointId) {
continue;
}
LOG.info("{} - checkpoint {} complete, committing transaction {} from checkpoint {}",
name(), checkpointId, pendingTransaction, pendingTransactionCheckpointId);
logWarningIfTimeoutAlmostReached(pendingTransaction);
try {
commit(pendingTransaction.handle);
} catch (Throwable t) {
if (firstError == null) {
firstError = t;
}
}
LOG.debug("{} - committed checkpoint transaction {}", name(), pendingTransaction);
pendingTransactionIterator.remove();
}
if (firstError != null) {
throw new FlinkRuntimeException("Committing one of transactions failed, logging first encountered failure",
firstError);
}
}
这时消费端就能看到 read_committed 的数据了,至此整个 producer 的流程全部结束。@Override
protected void commit(FlinkKafkaProducer.KafkaTransactionState transaction) {
if (transaction.isTransactional()) {
try {
transaction.producer.commitTransaction();
} finally {
recycleTransactionalProducer(transaction.producer);
}
}
}
■ Exactly-Once 分析
当输入源和输出都是 kafka 的时候,Flink 之所以能做到端到端的 Exactly-Once 语义,主要是因为第一阶段 FlinkKafkaConsumer 会将消费的 offset 信息通过checkpoint 保存,所有 checkpoint 都成功之后,第二阶段 FlinkKafkaProducer 才会提交事务,结束 producer 的流程。这个过程中很大程度依赖了 kafka producer 事务的机制。Flink 中 notifyCheckpointComplete 方法调用顺序
定义
notifyCheckpointComplete 方法在 CheckpointListener 接口中定义。
简单说这个方法的含义就是在 checkpoint 做完之后,JobMaster 会通知 task 执行这个方法,例如在 FlinkKafkaProducer 中 notifyCheckpointComplete 中做了事务的提交。/**
* This interface must be implemented by functions/operations that want to receive
* a commit notification once a checkpoint has been completely acknowledged by all
* participants.
*/
@PublicEvolving
public interface CheckpointListener {
/**
* This method is called as a notification once a distributed checkpoint has been completed.
*
* Note that any exception during this method will not cause the checkpoint to
* fail any more.
*
* @param checkpointId The ID of the checkpoint that has been completed.
* @throws Exception
*/
void notifyCheckpointComplete(long checkpointId) throws Exception;
}
样例
下面的程序会被分为两个 task,task1 是 Source: Example Source 和 task2 是 Map -> Sink: Example Sink。
DataStream<KafkaEvent> input = env.addSource(
new FlinkKafkaConsumer<>("foo", new KafkaEventSchema(), properties)
.assignTimestampsAndWatermarks(new CustomWatermarkExtractor())).name("Example Source")
.keyBy("word")
.map(new MapFunction<KafkaEvent, KafkaEvent>() {
@Override
public KafkaEvent map(KafkaEvent value) throws Exception {
value.setFrequency(value.getFrequency() + 1);
return value;
}
});
input.addSink(
new FlinkKafkaProducer<>(
"bar",
new KafkaSerializationSchemaImpl(),
properties,
FlinkKafkaProducer.Semantic.EXACTLY_ONCE)).name("Example Sink");
■ operator 调用 notifyCheckpointComplete
根据上面的例子,task1 中只有一个 source 的 operator,但是 task2 中有两个operator,分别是 map 和 sink。在 StreamTask 中,调用 task 的 notifyCheckpointComplete 方法。其中关键的部分就是:@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
boolean success = false;
synchronized (lock) {
if (isRunning) {
LOG.debug("Notification of complete checkpoint for task {}", getName());
for (StreamOperator<?> operator : operatorChain.getAllOperators()) {
if (operator != null) {
operator.notifyCheckpointComplete(checkpointId);
}
}
success = true;
}
else {
LOG.debug("Ignoring notification of complete checkpoint for not-running task {}", getName());
}
}
if (success) {
syncSavepointLatch.acknowledgeCheckpointAndTrigger(checkpointId, this::finishTask);
}
}
operator 的调用顺序取决于 allOperators 变量,可以看到源码中的注释,operator 是以逆序存放的。for (StreamOperator<?> operator : operatorChain.getAllOperators()) {
if (operator != null) {
operator.notifyCheckpointComplete(checkpointId);
}
}
也就是说上面客户端的代码,虽然先调用了 map 后调用的 sink,但是实际执行的时候,确实先调用 sink 的 notifyCheckpointComplete 方法,后调用 map 的。/**
* Stores all operators on this chain in reverse order.
*/
private final StreamOperator<?>[] allOperators;
对 Exactly-Once 语义的影响
上面的例子,是先执行 source 的 notifyCheckpointComplete 方法,再执行 sink 的 notifyCheckpointComplete 方法。但是如果把 .keyBy("word") 去掉,那么只会有一个 task,所有 operator 逆序执行,也就是先调用 sink 的 notifyCheckpointComplete 方法再调用 source 的。 为了方便理解整个流程,下文只考察并发度为1的情况,不考虑部分 subtask 成功部分不成功的情况。 Tips:以下讨论的都是基于 kafka source 和 sink■ 先 sink 后 source
sink 成功之后 source 执行之前 |
sink 成功之前 |
checkpoint 恢复 |
exactly-once |
__consumer_offsets 恢复 | 重复消费 |
- 测试用例
测试环境采用的是 Flink 1.9.0 Standalone Cluster 模式,一个 JobManager,一个TaskManager,默认只保存一个 checkpoint。模拟异常的方法,通过 kill -9 杀掉 JobManager 和 TaskManager 进程。DataStream<KafkaEvent> input = env.addSource(
new FlinkKafkaConsumer<>("foo", new KafkaEventSchema(), properties)
.assignTimestampsAndWatermarks(new CustomWatermarkExtractor())).name("Example Source")
.map(new MapFunction<KafkaEvent, KafkaEvent>() {
@Override
public KafkaEvent map(KafkaEvent value) throws Exception {
value.setFrequency(value.getFrequency() + 1);
return value;
}
});
input.addSink(
new FlinkKafkaProducer<>(
"bar",
new KafkaSerializationSchemaImpl(),
properties,
FlinkKafkaProducer.Semantic.EXACTLY_ONCE)).name("Example Sink");
-
在 FlinkKafkaProducer#commit 方法第一行设置断点,当程序走到这个断点的时候 kill -9 杀掉 JobManager 和 TaskManager 进程,模拟 sink 的notifyCheckpointComplete 方法执行失败的场景;
-
监控1,通过 bin/kafka-console-consumer.sh --topic bar --bootstrap-server 10.1.236.66:9092 监控 producer 是否 flush 数据;监控2,通过 bin/kafka-console-consumer.sh --topic bar --bootstrap-server 10.1.236.66:9092 --isolation-level read_committed 监控 producer 的事务是否成功提交;监控3,通过 bin/kafka-console-consumer.sh --topic __consumer_offsets --bootstrap-server 10.1.236.66:9092 --formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter" --consumer.config /tmp/Consumer.properties 监控 consumer 的offset 是否提交到 kafka;
-
发送数据一条数据 a,5,1572845161023,当走到断点的时候,说明 consumer 的 checkpoint 已经生成,但是还没有将 offset 提交到 kafka,也就是checkpoint 认为 offset 已经成功发送,但是 kafka 认为并没有发送,监控1有数据,监控2和监控3都没有数据。kill -9 杀掉 JobManager 和 TaskManager进程;
-
重新启动,并提交作业,不指定 checkpoint 路径。监控1,2,3,都有数据,所以这种情况,监控2,只收到了一次数据,也就是 exactly-once。这时候监控3收到的数据为:partition0 的 offset=37,partition1 的 offset=43,partition2 的 offset=39;
-
同样1-3步骤,发送数据一条数据 b,6,1572845161023,第4步,启动作业的时候通过-s指定要恢复的 checkpoint 路径,启动后监控1,2都没有数据,但是监控3的数据为:partition0 的 offset=37,partition1 的 offset=43,partition2 的 offset=40,再查看 task 的日志 FlinkKafkaConsumerBase - Consumer subtask 0 restored state: {KafkaTopicPartition{topic='foo', partition=0}=36, KafkaTopicPartition{topic='foo', partition=1}=42, KafkaTopicPartition{topic='foo', partition=2}=39}.,说明 checkpoint 认为上一次 partition2 的 offset=39 已经成功消费,所以恢复之后向 kafka 发送的offset 为 40。这样就导致了 partition2 的 offset=39 这条数据丢失。
- 原因分析
■ 先 source 后 sink
需要说明的一点这个场景的两个 task 实际是并行的,并没有绝对的先后关系,只是会有这种前后关系的可能。source 成功之后 sink 执行之前 | source 成功之前 |
checkpoint 恢复 | 丢数据 |
__consumer_offsets 恢复 | 丢数据 |
- 测试用例
-
需要在上面的用例中加入 keyby 算子,确保生成两个 task,监控3收到数据的时候说明 consumer 的 notifyCheckpointComplete 方法已经执行完。在FlinkKafkaProducer#commit 方法第一行设置断点,当程序走到这个断点并且监控3收到数据的时候,kill -9 杀掉 JobManager 和 TaskManager 进程,模拟 sink 执行 notifyCheckpointComplete 方法失败的场景;
-
这时候重启作业,checkpoint 和 kafka 中 offset 已经是一致的了,无论是从checkpoint 还是 kafka,都是一样的。所以 source 认为已经成功消费了,不会再读上次的 offset,都会导致数据丢失。
对于在 source 之前程序就挂掉,相当于所有的 operator 都没有执行notifyCheckpointComplete 方法,但是 source 的 checkpoint 已经做过了,只是没有将 offset 发送到 kafka,这样只有从 __consumer_offsets 恢复才能保证不丢数据。
小结:本节通过一种极端的测试场景希望让读者可以更深入的理解 Flink 中的 Exactly-Once 语义。在程序挂了以后需要排查是什么原因和什么阶段导致的,才能通过合适的方式恢复作业。在实际的生产环境中,会有重试或者更多的方式保证高可用,也建议保留多个 checkpoint,以便业务上可以恢复正确的数据。
作者介绍:吴鹏,亚信科技资深工程师,Apache Flink Contributor。先后就职于中兴,IBM,华为。目前在亚信科技负责实时流处理引擎产品的研发。