一、PartitionStateMachine的主要功能

Kafka集群中,Topic的分区状态有PartitionStateMachine模块负责,通过在zookeeper上的目录/brokers/topics和/admin/delete_topics注册不同的监听函数,监听Topic的创建和删除事件,从而触发Topic的分区状态转换。

二、分区状态的转换

PartitionStateMachine内部的partitionState变量保存了每个具体的Topic的分区状态,如下所示

class PartitionStateMachine(controller: KafkaController) extends Logging {
   ......
   private val partitionState: mutable.Map[TopicAndPartition, PartitionState] = mutable.Map.empty
   ......
}

分区的状态有4种,分别为NonExistentPartition、NewPartition、OfflinePartition、OfflinePartition,它们各自的生命周期如下:

kafka partition多个的好处 kafka topic partition数量_Server

其中,

    NonExistentPartition:代表分区从来没有被创建或者被创建之后又删除的状态。

    NewPartition:代表分区刚被创建,并且包含了AR,但是此时Leader或者ISR还没有被创建。

    OnlinePartition:代表分区的Leader已经被选举出来,并且此时已经产生了对应的ISR。

    OfflinePartition:代表了分区的Leader由于某种原因下线时导致分区暂时不可用的状态。

分区状态转换的规则如下:

目标状态

前置状态

转换场景

NewPartition

NonExistentPartition

用户创建topic,将topic信息写入zookeeper,KafkaController监听到/brokers/topics目录上数据发生变化,加载新创建的Topic信息,包括分区个数,AR列表

OnlinePartition

NewPartition

OnlinePartition

OfflinePartition

1)针对新创建的分区选举出Leader和生成ISR列表。

2)针对分区重新进行leader选举和生成新的ISR列表

OfflinePartition

NewPartition

OnlinePartition

OfflinePartition

AR中没有任何在线的Broker Server

NonExistentPartition

OfflinePartition

分区被删除

三、PartitionStateMachine模块的启动

在KafkaController选举流程中,如果一个Broker Server被选举为leader,则会进入函数onControllerFailover,进行初始化操作(参见博客:KafkaController的初始化流程源码解析_bao2901203013的专栏在初始化的流程中就会启动PartitionStateMachine。PartitionStateMachine的启动过程会初始化各个partition的状态,首先会根据Leader Replica是否在线初始化为OnlinePartition或者OfflinePartition,其次如果没有被分配Leader Replica,因此被初始化为NewPartition,接着尝试将状态为OfflinePartition或者NewPartition的partition转换为OnlinePartition,最后将Partition的状态通过ControllerChannelManager同步给其它剩余的Broker Server。

PartitionStateMachine详细的启动流程如下:

def startup() {
    // 初始化分区状态
    initializePartitionState()
    // 设置启动标志
    hasStarted.set(true)
    // 触发onlinePartition的状态的转换
    triggerOnlinePartitionStateChange()
    info("Started partition state machine with initial state -> " + partitionState.toString())
  }

其中函数initializePartitionState()会将分区初始化为三种状态:NewPartition、OnlinePartition、OfflinePartition,其具体实现过程如下:

private def initializePartitionState() {
    for((topicPartition, replicaAssignment) <- controllerContext.partitionReplicaAssignment) {
      controllerContext.partitionLeadershipInfo.get(topicPartition) match {
        case Some(currentLeaderIsrAndEpoch) =>
          // partition已经被分配了Leader和ISR
          controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader) match {
            case true => // leader在线,状态为OnlinePartition
              partitionState.put(topicPartition, OnlinePartition)
            case false =>  //leader不在线,状态为OfflinePartition
              partitionState.put(topicPartition, OfflinePartition)
          }
        //没有被分配Leader和ISR
        case None =>
          partitionState.put(topicPartition, NewPartition)
      }
    }
  }

当区分出NewPartition、OnlinePartition、OfflinePartition三种状态之后,会尝试将NewPartition和OfflinePartition转换为OnlinePartition,其具体实现过程如下:

def triggerOnlinePartitionStateChange() {
    try {
      brokerRequestBatch.newBatch()
      // 剔除处于删除状态的Topic
      for((topicAndPartition, partitionState) <- partitionState
          if(!controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic))) {
        /*筛选出offlinePartition和NewPartition状态的分区,然后努力将它们的状态转换为OnlinePartition*/
        if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
          handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
                            (new CallbackBuilder).build)
      }
      //同步分区信息给其它的Broker Server
      brokerRequestBatch.sendRequestsToBrokers(controller.epoch, controllerContext.correlationId.getAndIncrement)
    } catch {
       ......
    }
  }

四、分区状态转换的场景

PartitionStateMachine内部的handleStateChange负责分区状态的具体转换逻辑,具体参数如下:

private def handleStateChange(topic: String, 
                              partition: Int, 
                              targetState: PartitionState,
                              leaderSelector: PartitionLeaderSelector,
                              callbacks: Callbacks) {
......
}

其中topic代表当前partition所在的Topic,partition代表分区索引,targetState代表目标状态。

下面介绍几种转换场景。

1、NonExistentPartition  -> NewPartition

转换代码如下:

//目标状态为NewPartition
case NewPartition =>
          //确保前置状态是NonExistentPartition
          assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)
          /*从zookeeper目录上/brokers/topics/ 具体的topic读取该topic的AR列表,并且保存至KafkaController内存*/
          assignReplicasToPartitions(topic, partition)
          //将状态切换为NewPartition
          partitionState.put(topicAndPartition, NewPartition)
          val assignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).mkString(",")

可见NonExistentPartition切换成NewPartition的逻辑比较简单,就是持久化该Topic在Zookeeper目录上各个分区的AR列表至KafkaController内存,然后设置分区状态为NewPartition。

2、NewPartition、OnlinePartition、OfflinePartition-> OnlinePartition

转换代码如下:

//目标状态为OnlinePartition
case OnlinePartition =>
          //确保前置状态为NewPartition、OnlinePartition、OfflinePartition
          assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)
          partitionState(topicAndPartition) match {
            case NewPartition =>
              // 利用分区的AR列表初始化Leader和ISR
              initializeLeaderAndIsrForPartition(topicAndPartition)
            
            case OfflinePartition =>
              //利用Leader Replica选举器来初始化Leader和ISR
              electLeaderForPartition(topic, partition, leaderSelector)
             
            case OnlinePartition => 
              //利用Leader Replica选举器来初始化Leader和ISR
              electLeaderForPartition(topic, partition, leaderSelector)
            case _ => // 不应该走到这个流程
          }
          //在内存中设置状态为OnlinePartition
          partitionState.put(topicAndPartition, OnlinePartition)
          val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader

可见当NewPartition转换为OnlinePartition过程中,不需要Leader Replica选举器来进Leader和ISR的选举,这个时候直接将AR列表中的第一个Live Brokers作为Leader,且Live Brokers作为ISR。

但是在在OfflinePartition和OnlinePartition转换为OnlinePartition过程中需要利用Leader选举器来选举,其过程如下:

def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {

    //组装TopicAndPartition 
    val topicAndPartition = TopicAndPartition(topic, partition)
    try {
      var zookeeperPathUpdateSucceeded: Boolean = false
      var newLeaderAndIsr: LeaderAndIsr = null
      var replicasForThisPartition: Seq[Int] = Seq.empty[Int]
      while(!zookeeperPathUpdateSucceeded) {
        //从zookeeper读取LeaderIsrAndControllerEpoch
        val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition)
        val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr
        val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch
        if (controllerEpoch > controller.epoch) {
          /*
          * 只有状态为Leader的KafakController才会触发Partition Leader 选举
          * 如果zookeeper上记录的controllerEpoch大于当前的epoch,则表明当前KafkaController已经过时了 */
          throw new StateChangeFailedException(failMsg)
        }
        //根据TopicAndPartition和当前的LeaderAndIsr选举出新的LeaderAndIsr
        val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr)
        //持久化至zookeeper
        val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topic, partition,
          leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)
        newLeaderAndIsr = leaderAndIsr
        newLeaderAndIsr.zkVersion = newVersion
        zookeeperPathUpdateSucceeded = updateSucceeded
        replicasForThisPartition = replicas
      }
      val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)
      // 更新至KafkaController的内存
      controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)
  
      val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))
      // 组装元数据请求,同步本节点的信息给集群剩余的Broker Server
      brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,
        newLeaderIsrAndControllerEpoch, replicas)
    } catch {
      ......
    }
  }

3、NewPartition、OnlinePartition、OfflinePartition-> OfflinePartition

转换代码如下:

//目标状态为OfflinePartition
case OfflinePartition =>
          // 确保先前状态为NewPartition、OnlinePartition、OfflinePartition
          assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)
          //将其状态转换为OfflinePartition
          partitionState.put(topicAndPartition, OfflinePartition)

给过程很简单,只是将内存中的状态修改为OfflinePartition。

五、总结

PartitionStateMachine作为Kafka集群中分区状态管理模块,通过监听zookeeper中目录的变化来执行分区状态转换工作,满足了topic创建、partition重分配、topic删除等partition变化场景,是kafka集群元数据管理中非常重要的模块。