什么是幂等性?

一次或多次请求同一个资源,其资源本身不发生变化,结果可能不同

关于幂等性广泛应用于分布式项目中,这与互斥性同为分布式项目需要重视两大问题点

Kafka中的幂等性

Producer在生产发送消息,偶遇网络延时等不可控因素,需要重复发送消息,Producer进行retry时会产生重试机制,由于Kafka中引入幂等性,保证消息不会重复接收,保证Exactly-once语义

kafka 幂等性 和事务 kafka消息幂等_幂等性

上图中显示正常情况下Producer发送消息至Broker,Broker在追加消息后返回ack应答到Producer,若在Producer和Broker中出现网络异常或Broker端出现fullGC导致ack应答延时或丢失,Producer会再次发送该消息造成重复

kafka 幂等性 和事务 kafka消息幂等_kafka 幂等性 和事务_02

在0.11版本之前,重复发送消息kafka无法解决,没有办法完全保证exactly once语义,只是达到at least once语义,下游要保证精确一次的消费语义,需要加上去重操作,在0.11版本引入幂等性后,只需要将Producer的enable.idempotence配置项设置为true,即可确保精确消费一次的语义

  • kafka中幂等性的实现方式
  • PID,在Producer初始化时会分配全局唯一的会话标识
  • 序列号(sequence number),Producer发送每个批次(在实际开发中,我们都会把数据先放到本地缓存,再由异步send现程发送,具体看后续模块Producer端发送消息介绍)都会带有当次会话中递增的序列号,从0开始递增,Broker根据序列号值判断该批次数据是否需要丢弃

kafka 幂等性 和事务 kafka消息幂等_List_03

Kafka的幂等性实现了对于单个Producer会话,单个TopicPartition级别的不重复不遗漏,即最细粒度保证,如果Producer重启(PID发生变化),或写入时跨Topic、跨partition,简单的幂等性会失效,需要更高级别的食物来解决,此时又专门的协调组建TransactionCoordinator做协调

Producer端处理逻辑

1、组件说明

  • KafkaProducer:Producer实例
  • Sender:KafkaProducer内置的发送消息到Broker的线程
  • RecordAccumulator:消息批次ProducerBatch的累加器(缓存),当缓存满了就会唤醒Sender线程发送消息
  • TranscationManager:如果启用了幂等性/事务性,Producer内的该组件就会记录PID、哥哥TopicPartion的序列号和事务状态信息

2、简易流程分析

当客户端调用KafkaProducer.send()方法,消息会先存入RecordAccumulator中,知道缓存达到设置阈值,唤醒sender线程发送消息,而此次发送的ProducerBatch由于第一次发送还没有分配PID和序列号,Sender线程的run方法

void run(long now) {
        if (transactionManager != null) {
            try {
                if (transactionManager.shouldResetProducerStateAfterResolvingSequences())
                    // Check if the previous run expired batches which requires a reset of the producer state.
                    transactionManager.resetProducerId();

                if (!transactionManager.isTransactional()) {
                    // this is an idempotent producer, so make sure we have a producer id
                    maybeWaitForProducerId();
                } else if (transactionManager.hasUnresolvedSequences() && !transactionManager.hasFatalError()) {
                    transactionManager.transitionToFatalError(new KafkaException("The client hasn't received acknowledgment for " +
                            "some previously sent messages and can no longer retry them. It isn't safe to continue."));
                } else if (transactionManager.hasInFlightTransactionalRequest() || maybeSendTransactionalRequest(now)) {
                    // as long as there are outstanding transactional requests, we simply wait for them to return
                    client.poll(retryBackoffMs, now);
                    return;
                }

                // do not continue sending if the transaction manager is in a failed state or if there
                // is no producer id (for the idempotent case).
                if (transactionManager.hasFatalError() || !transactionManager.hasProducerId()) {
                    RuntimeException lastError = transactionManager.lastError();
                    if (lastError != null)
                        maybeAbortBatches(lastError);
                    client.poll(retryBackoffMs, now);
                    return;
                } else if (transactionManager.hasAbortableError()) {
                    accumulator.abortUndrainedBatches(transactionManager.lastError());
                }
            } catch (AuthenticationException e) {
                // This is already logged as error, but propagated here to perform any clean ups.
                log.trace("Authentication exception while processing transactional request: {}", e);
                transactionManager.authenticationFailed(e);
            }
        }

        long pollTimeout = sendProducerData(now);
        client.poll(pollTimeout, now);
    }

先分析其中的maybeWaitForProducerId方法,该方法则是检查当前批次是否分配PID

private void maybeWaitForProducerId() {
        while (!transactionManager.hasProducerId() && !transactionManager.hasError()) {
            try {
                Node node = awaitLeastLoadedNodeReady(requestTimeoutMs);
                if (node != null) {
                    ClientResponse response = sendAndAwaitInitProducerIdRequest(node);
                    InitProducerIdResponse initProducerIdResponse = (InitProducerIdResponse) response.responseBody();
                    Errors error = initProducerIdResponse.error();
                    if (error == Errors.NONE) {
                        ProducerIdAndEpoch producerIdAndEpoch = new ProducerIdAndEpoch(
                                initProducerIdResponse.producerId(), initProducerIdResponse.epoch());
                        transactionManager.setProducerIdAndEpoch(producerIdAndEpoch);
                        return;
                    } else if (error.exception() instanceof RetriableException) {
                        log.debug("Retriable error from InitProducerId response", error.message());
                    } else {
                        transactionManager.transitionToFatalError(error.exception());
                        break;
                    }
                } else {
                    log.debug("Could not find an available broker to send InitProducerIdRequest to. " +
                            "We will back off and try again.");
                }
            } catch (UnsupportedVersionException e) {
                transactionManager.transitionToFatalError(e);
                break;
            } catch (IOException e) {
                log.debug("Broker {} disconnected while awaiting InitProducerId response", e);
            }
            log.trace("Retry InitProducerIdRequest in {}ms.", retryBackoffMs);
            time.sleep(retryBackoffMs);
            metadata.requestUpdate();
        }
    }

再分配发送节点时,先去请求一次,获取到当前资源相对空闲的节点,再根据返回的节点信息,发送一次初始化Producer的请求,从请求返回结果中难道此次会话的纪元值(epoch,和事务相关)和PID,请求成功会将PID和纪元值保存再TransactionManager实例中。若请求失败则会根据参数配置进行重试

接下来分析Sender.run()方法里的sendProducerData()方法正式去除RecordAccumulator中缓存的消息,最终封装成ProducerRequest,以下街区sendProducereData()中的一段代码

// create produce requests
Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster,                     
    result.readyNodes,this.maxRequestSize, now);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
    for (List<ProducerBatch> batchList : batches.values()) {
        for (ProducerBatch batch : batchList)
            this.accumulator.mutePartition(batch.topicPartition);
    }
}

其中accumulator.drain()方法含有较多关于幂等性的相关代码

public Map<Integer, List<ProducerBatch>> drain(Cluster cluster,
                                                   Set<Node> nodes,
                                                   int maxSize,
                                                   long now) {
        if (nodes.isEmpty())
            return Collections.emptyMap();

        Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
        for (Node node : nodes) {
            int size = 0;
            List<PartitionInfo> parts = cluster.partitionsForNode(node.id());
            List<ProducerBatch> ready = new ArrayList<>();
            /* to make starvation less likely this loop doesn't start at 0 */
            int start = drainIndex = drainIndex % parts.size();
            do {
                PartitionInfo part = parts.get(drainIndex);
                TopicPartition tp = new TopicPartition(part.topic(), part.partition());
                // Only proceed if the partition has no in-flight batches.
                if (!isMuted(tp, now)) {
                    Deque<ProducerBatch> deque = getDeque(tp);
                    if (deque != null) {
                        synchronized (deque) {
                            ProducerBatch first = deque.peekFirst();
                            if (first != null) {
                                boolean backoff = first.attempts() > 0 && first.waitedTimeMs(now) < retryBackoffMs;
                                // Only drain the batch if it is not during backoff period.
                                if (!backoff) {
                                    if (size + first.estimatedSizeInBytes() > maxSize && !ready.isEmpty()) {
                                        // there is a rare case that a single batch size is larger than the request size due
                                        // to compression; in this case we will still eventually send this batch in a single
                                        // request
                                        break;
                                    } else {
                                        ProducerIdAndEpoch producerIdAndEpoch = null;
                                        boolean isTransactional = false;
                                        if (transactionManager != null) {
                                            if (!transactionManager.isSendToPartitionAllowed(tp))
                                                break;

                                            producerIdAndEpoch = transactionManager.producerIdAndEpoch();
                                            if (!producerIdAndEpoch.isValid())
                                                // we cannot send the batch until we have refreshed the producer id
                                                break;

                                            isTransactional = transactionManager.isTransactional();

                                            if (!first.hasSequence() && transactionManager.hasUnresolvedSequence(first.topicPartition))
                                                // Don't drain any new batches while the state of previous sequence numbers
                                                // is unknown. The previous batches would be unknown if they were aborted
                                                // on the client after being sent to the broker at least once.
                                                break;

                                            int firstInFlightSequence = transactionManager.firstInFlightSequence(first.topicPartition);
                                            if (firstInFlightSequence != RecordBatch.NO_SEQUENCE && first.hasSequence()
                                                    && first.baseSequence() != firstInFlightSequence)
                                                // If the queued batch already has an assigned sequence, then it is being
                                                // retried. In this case, we wait until the next immediate batch is ready
                                                // and drain that. We only move on when the next in line batch is complete (either successfully
                                                // or due to a fatal broker error). This effectively reduces our
                                                // in flight request count to 1.
                                                break;
                                        }

                                        ProducerBatch batch = deque.pollFirst();
                                        if (producerIdAndEpoch != null && !batch.hasSequence()) {
                                            // If the batch already has an assigned sequence, then we should not change the producer id and
                                            // sequence number, since this may introduce duplicates. In particular,
                                            // the previous attempt may actually have been accepted, and if we change
                                            // the producer id and sequence here, this attempt will also be accepted,
                                            // causing a duplicate.
                                            //
                                            // Additionally, we update the next sequence number bound for the partition,
                                            // and also have the transaction manager track the batch so as to ensure
                                            // that sequence ordering is maintained even if we receive out of order
                                            // responses.
                                            batch.setProducerState(producerIdAndEpoch, transactionManager.sequenceNumber(batch.topicPartition), isTransactional);
                                            transactionManager.incrementSequenceNumber(batch.topicPartition, batch.recordCount);
                                            log.debug("Assigned producerId {} and producerEpoch {} to batch with base sequence " +
                                                            "{} being sent to partition {}", producerIdAndEpoch.producerId,
                                                    producerIdAndEpoch.epoch, batch.baseSequence(), tp);

                                            transactionManager.addInFlightBatch(batch);
                                        }
                                        batch.close();
                                        size += batch.records().sizeInBytes();
                                        ready.add(batch);
                                        batch.drained(now);
                                    }
                                }
                            }
                        }
                    }
                }
                this.drainIndex = (this.drainIndex + 1) % parts.size();
            } while (start != drainIndex);
            batches.put(node.id(), ready);
        }
        return batches;
    }
  • 当前轮询分区没有正在处理批次
  • 当前批次没有在等待重试见各种
  • PID和纪元值有效
  • 该批次的前面没有未发送成功的批次,TransactionManager.hasUnresolvedSequence()用于判断之前的序列号对应的消息状态是否已确定(保证发送的消息不会出现乱序)
  • 该TopicPartion没有数据在发送。如果有批次时in-flight的,并且它的序列号与本批次不同,说明本批次时重试的,需要等待in-flight的数据发送完成

以上判断通过才能发送当前批次,如果当前批次没有序列号,则表示是第一次发送,此时就会将PID、纪元值、序列号写入该ProducerBatch,并调用TransactionManager.incrementSequenceNumber()增加维护的序列号的值,最后将其标记未in-flight,即待发送状态

Broker端的处理逻辑(scala)

  • BatchMetadata:存储批次的元数据,包括该批次钟最后一条消息的序列号、offset、以及消息的偏移量(条数)等信息
  • ProducerIdEntry:以队列的形式维护单个PID对应的最BatchMetadata,序列号小的在前,大的在后,并且容量固定为5
  • ProducerStateManger:管理所有TopicPartition的ProducerIdEntry

Producer端的ProduceRequest请求发出后,由Broker通过KafkaApis.handleProduceRequest()方法处理,最终落盘时,回来到Log.append()方法。来看看它时怎么校验PID和序列号的

private def analyzeAndValidateProducerState(records: MemoryRecords, isFromClient: Boolean):
  (mutable.Map[Long, ProducerAppendInfo], List[CompletedTxn], Option[BatchMetadata]) = {
    val updatedProducers = mutable.Map.empty[Long, ProducerAppendInfo]
    val completedTxns = ListBuffer.empty[CompletedTxn]
    for (batch <- records.batches.asScala if batch.hasProducerId) {
      val maybeLastEntry = producerStateManager.lastEntry(batch.producerId)
 
      if (isFromClient)
        maybeLastEntry.flatMap(_.duplicateOf(batch)).foreach { duplicate =>
          return (updatedProducers, completedTxns.toList, Some(duplicate))
        }
 
      val maybeCompletedTxn = updateProducers(batch, updatedProducers, isFromClient = isFromClient)
      maybeCompletedTxn.foreach(completedTxns += _)
    }
    (updatedProducers, completedTxns.toList, None)
}

在守护循环中先过滤批次含有PID的批次信息,再从ProducerStateManager维护的对应BatchMetadata中寻找是否有重复的(即该批次是否为重复发送)主要实现为duplicateOf()方法

def duplicateOf(batch: RecordBatch): Option[BatchMetadata] = {
    if (batch.producerEpoch() != producerEpoch)
       None
    else
      batchWithSequenceRange(batch.baseSequence(), batch.lastSequence())
  }
 
  def batchWithSequenceRange(firstSeq: Int, lastSeq: Int): Option[BatchMetadata] = {
    val duplicate = batchMetadata.filter { case(metadata) =>
      firstSeq == metadata.firstSeq && lastSeq == metadata.lastSeq
    }
    duplicate.headOption
  }

如果此批次的第一条消息和最后一套消息的序列号和缓存中的完全相同,表示他是重复发送,当该批次不是重复发送时,才会继续调用updateProducers()方法更新BatchMetadata信息

校验序列号的逻辑位于ProducerAppendInfo.checkSequence()方法

private def checkSequence(producerEpoch: Short, firstSeq: Int, lastSeq: Int): Unit = {
    if (producerEpoch != currentEntry.producerEpoch) {
      if (firstSeq != 0) {
        if (currentEntry.producerEpoch != RecordBatch.NO_PRODUCER_EPOCH) {
          throw new OutOfOrderSequenceException(s"Invalid sequence number for new epoch: $producerEpoch " +
            s"(request epoch), $firstSeq (seq. number)")
        } else {
          throw new UnknownProducerIdException(s"Found no record of producerId=$producerId on the broker. It is possible " +
            s"that the last message with the producerId=$producerId has been removed due to hitting the retention limit.")
        }
      }
    } else if (currentEntry.lastSeq == RecordBatch.NO_SEQUENCE && firstSeq != 0) {
      throw new OutOfOrderSequenceException(s"Out of order sequence number for producerId $producerId: found $firstSeq " +
        s"(incoming seq. number), but expected 0")
    } else if (isDuplicate(firstSeq, lastSeq)) {
      throw new DuplicateSequenceException(s"Duplicate sequence number for producerId $producerId: (incomingBatch.firstSeq, " +
        s"incomingBatch.lastSeq): ($firstSeq, $lastSeq).")
    } else if (!inSequence(firstSeq, lastSeq)) {
      throw new OutOfOrderSequenceException(s"Out of order sequence number for producerId $producerId: $firstSeq " +
        s"(incoming seq. number), ${currentEntry.lastSeq} (current end sequence number)")
    }
  }
 
  private def isDuplicate(firstSeq: Int, lastSeq: Int): Boolean = {
    ((lastSeq != 0 && currentEntry.firstSeq != Int.MaxValue && lastSeq < currentEntry.firstSeq)
      || currentEntry.batchWithSequenceRange(firstSeq, lastSeq).isDefined)
  }
 
  private def inSequence(firstSeq: Int, lastSeq: Int): Boolean = {
    firstSeq == currentEntry.lastSeq + 1L || (firstSeq == 0 && currentEntry.lastSeq == Int.MaxValue)
}

可见,如果Producer的纪元值发生过变化,那么写入的批次序列号一定要是0(因为Producer不再是原来的那个了)。当维护的最近一条序列号为-1时,表示此PID对应的Producer还未生产过消息,写入的批次序列号也必须是0。最后一个合法条件就是序列号是严格+1,当其达到整形最大值时,就回滚到0重新开始计。

特别注意,由于ProducerIdEntry只能固定缓存5个BatchMetadata,所以在开启幂等性时,Producer的max.in.flight.requests.per.connection参数不能设为大于5的值。显然,若设为大于5的话,有可能会造成某一批次的元数据被挤出缓存,如果该批次又发生重试,就会因为永远找不到其对应的BatchMetadata而达到最大重试次数,效率大大降低