1. controller 选举

每个kafka集群里的controller在某一个时刻只能由一个Broker担任,这个Broker是由集群里的所有Broker选举出来的, 随着时间的推移,Controller可能易主。

选举原理:

kafka controller 作用 kafka controller选举_监听器


选举时,每个Broker都尝试向zookeeper写入/controller,但只能有一个Broker成功,这个Broker节点就是Controller所在的节点,同时每个Broker会向/controller节点注册监听器,这样当原来Controller所在的Broker挂掉后其余Broker就能感知到,以便触发新的选举流程。

选举时机:

  1. 集群启动
  2. /controller 节点消失
  3. /controller节点发生变更

Controller的主要代码在KafkaController.scala文件中,首先看下KafkaController类的重要字段。

class KafkaController(val config: KafkaConfig, // Kafka配置信息
                      zkClient: KafkaZkClient, // zk客户端,封装了对zookeeper的所有操作
                      time: Time, // 时间工具类
                      metrics: Metrics, // 实现指标监控服务(如创建监控指标)的工具类
                      initialBrokerInfo: BrokerInfo, // Broker节点信息
                      initialBrokerEpoch: Long, // Broker Epoch值,用于隔离老Controller发送的请求
                      tokenManager: DelegationTokenManager, // 实现Delegation token管理的工具类
                      threadNamePrefix: Option[String] = None // Controller事件处理线程名字前缀
                     ) extends Logging with KafkaMetricsGroup {
                     
  // 集群元数据类,保存集群所有元数据
  val controllerContext = new ControllerContext
  // 线程调度器,当前唯一负责定期执行Leader重选举
  private[controller] val kafkaScheduler = new KafkaScheduler(1)
  // Controller事件管理器,负责管理事件处理线程
  private[controller] val eventManager = new ControllerEventManager(config.brokerId,
    controllerContext.stats.rateAndTimeMetrics, _ => updateMetrics(), () => maybeResign())
  // topic删除管理器
  val topicDeletionManager = new TopicDeletionManager(this, eventManager, zkClient)
  private val brokerRequestBatch = new ControllerBrokerRequestBatch(this, stateChangeLogger)
  // 副本状态机,负责副本状态转换
  val replicaStateMachine = new ReplicaStateMachine(config, stateChangeLogger, controllerContext, topicDeletionManager, zkClient, mutable.Map.empty, new ControllerBrokerRequestBatch(this, stateChangeLogger))
  // 分区状态机,负责分区状态转换
  val partitionStateMachine = new PartitionStateMachine(config, stateChangeLogger, controllerContext, zkClient, mutable.Map.empty, new ControllerBrokerRequestBatch(this, stateChangeLogger))
  partitionStateMachine.setTopicDeletionManager(topicDeletionManager)

  // Controller节点ZooKeeper监听器
  private val controllerChangeHandler = new ControllerChangeHandler(this, eventManager)
  // Broker数量ZooKeeper监听器
  private val brokerChangeHandler = new BrokerChangeHandler(this, eventManager)
  // Broker信息变更ZooKeeper监听器集合
  private val brokerModificationsHandlers: mutable.Map[Int, BrokerModificationsHandler] = mutable.Map.empty
  // 主题数量ZooKeeper监听器
  private val topicChangeHandler = new TopicChangeHandler(this, eventManager)
  // 主题删除ZooKeeper监听器
  private val topicDeletionHandler = new TopicDeletionHandler(this, eventManager)
  // 主题分区变更ZooKeeper监听器
  private val partitionModificationsHandlers: mutable.Map[String, PartitionModificationsHandler] = mutable.Map.empty
  // 主题分区重分配ZooKeeper监听器
  private val partitionReassignmentHandler = new PartitionReassignmentHandler(this, eventManager)
  // Preferred Leader选举ZooKeeper监听器
  private val preferredReplicaElectionHandler = new PreferredReplicaElectionHandler(this, eventManager)
  // ISR副本集合变更ZooKeeper监听器
  private val isrChangeNotificationHandler = new IsrChangeNotificationHandler(this, eventManager)
  // 日志路径变更ZooKeeper监听器
  private val logDirEventNotificationHandler = new LogDirEventNotificationHandler(this, eventManager)

  // 当前Controller所在Broker Id
  @volatile private var activeControllerId = -1
  // 离线分区总数
  @volatile private var offlinePartitionCount = 0
  // 满足Preferred Leader选举条件的总分区数
  @volatile private var preferredReplicaImbalanceCount = 0
  // 总主题数
  @volatile private var globalTopicCount = 0
  // 总分区数
  @volatile private var globalPartitionCount = 0

...

}

KafkaController的启动在KafkaServer的startup方法中完成的。

// KafkaServer.scala
kafkaController = new KafkaController(config, zkClient, time, metrics, brokerInfo, brokerEpoch, tokenManager, threadNamePrefix)
kafkaController.startup()
def startup() = {
    // registerStateChangeHandler用于session过期后触发重新选举
    zkClient.registerStateChangeHandler(new StateChangeHandler {
      override val name: String = StateChangeHandlers.ControllerHandler
      override def afterInitializingSession(): Unit = {
        eventManager.put(RegisterBrokerAndReelect)
      }
      override def beforeInitializingSession(): Unit = {
        val expireEvent = new Expire
        eventManager.clearAndPut(expireEvent)
        
        // 阻塞等待时间被处理结束,session过期触发重新选举,必须等待选举这个时间完成Controller才能正常工作
        expireEvent.waitUntilProcessingStarted()
      }
    })
    // 将Startup放入eventManager
    eventManager.put(Startup)
    // 启动eventManager后台线程开始选举
    eventManager.start()
  }

KafkaController启动时会向eventManager中发送Startup事件消息,Startup消息的具体处理逻辑在Startup的process方法中。

case object Startup extends ControllerEvent {

    def state = ControllerState.ControllerChange

    override def process(): Unit = {
       // 注册ControllerChangeHandler ZooKeeper监听器
      zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
      // 选举controller
      elect()
    }
  }

process 方法中会调用elect方法。elect的主要流程如下:

kafka controller 作用 kafka controller选举_kafka controller 作用_02

private def elect(): Unit = {
    // 获取当前Controller所在Broker的序号,如果Controller不存在,显式标记为-1
    activeControllerId = zkClient.getControllerId.getOrElse(-1)
    // 如果当前Controller已经选出来了,直接返回即可
    if (activeControllerId != -1) {
      debug(s"Broker $activeControllerId has been elected as the controller, so stopping the election process.")
      return
    }
    
    try {
      // 创建临时节点,让本节点参与选举
      val (epoch, epochZkVersion) = zkClient.registerControllerAndIncrementControllerEpoch(config.brokerId)
      controllerContext.epoch = epoch
      controllerContext.epochZkVersion = epochZkVersion
      activeControllerId = config.brokerId

      info(s"${config.brokerId} successfully elected as the controller. Epoch incremented to ${controllerContext.epoch} " +
        s"and epoch zk version is now ${controllerContext.epochZkVersion}")
      // 执行当选Controller的后续逻辑
      onControllerFailover()
    } catch {
      case e: ControllerMovedException =>
        maybeResign()

        if (activeControllerId != -1)
          debug(s"Broker $activeControllerId was elected as controller instead of broker ${config.brokerId}", e)
        else
          warn("A controller has been elected but just resigned, this will result in another round of election", e)

      case t: Throwable =>
        error(s"Error while electing or becoming controller on broker ${config.brokerId}. " +
          s"Trigger controller movement immediately", t)
        triggerControllerMove()
    }
  }

elect方法会尝试在zookeeper中创建/controller临时节点,因此只有Controller所在的节点才会执行onControllerFailover方法,其他节点都会进入异常处理。

kafka controller 作用 kafka controller选举_Startup_03

private def onControllerFailover() {
    info("Registering handlers")
    
    // 注册各种监听器
    val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler,
      isrChangeNotificationHandler)
    childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler)
    val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler)
    nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence)

    info("Deleting log dir event notifications")
    zkClient.deleteLogDirEventNotifications(controllerContext.epochZkVersion)
    info("Deleting isr change notifications")
    zkClient.deleteIsrChangeNotifications(controllerContext.epochZkVersion)
    info("Initializing controller context")
    initializeControllerContext()
    info("Fetching topic deletions in progress")
    val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
    info("Initializing topic deletion manager")
    // 初始化topic删除管理器
    topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)
    info("Sending update metadata request")
    // 发送更新集群元数据请求
    sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set.empty)

    // 启动副本状态机
    replicaStateMachine.startup()
    // 启动分区状态机
    partitionStateMachine.startup()

    info(s"Ready to serve as the new controller with epoch $epoch")
    maybeTriggerPartitionReassignment(controllerContext.partitionsBeingReassigned.keySet)
    topicDeletionManager.tryTopicDeletion()
    val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections()
    onPreferredReplicaElection(pendingPreferredReplicaElections, ZkTriggered)
    info("Starting the controller scheduler")
    kafkaScheduler.startup()
    if (config.autoLeaderRebalanceEnable) {
      scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS)
    }

    if (config.tokenAuthEnabled) {
      info("starting the token expiry check scheduler")
      tokenCleanScheduler.startup()
      tokenCleanScheduler.schedule(name = "delete-expired-tokens",
        fun = () => tokenManager.expireTokens,
        period = config.delegationTokenExpiryCheckIntervalMs,
        unit = TimeUnit.MILLISECONDS)
    }
  }

/controller节点变更会触发重新选举,具体方法为maybeResign。

private def maybeResign(): Unit = {
    // 判断Controller是否发生变更
    val wasActiveBeforeChange = isActive
    // 注册ControllerChangeHandler监听器
    zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
    activeControllerId = zkClient.getControllerId.getOrElse(-1)
    if (wasActiveBeforeChange && !isActive) {
      // 执行卸任逻辑
      onControllerResignation()
    }
  }

onControllerResignation复杂执行卸任逻辑。

private def onControllerResignation() {
    debug("Resigning")
    // de-register listeners
    // 取消ZooKeeper监听器的注册
    zkClient.unregisterZNodeChildChangeHandler(isrChangeNotificationHandler.path)
    zkClient.unregisterZNodeChangeHandler(partitionReassignmentHandler.path)
    zkClient.unregisterZNodeChangeHandler(preferredReplicaElectionHandler.path)
    zkClient.unregisterZNodeChildChangeHandler(logDirEventNotificationHandler.path)
    unregisterBrokerModificationsHandler(brokerModificationsHandlers.keySet)

    // reset topic deletion manager
    topicDeletionManager.reset()

    // shutdown leader rebalance scheduler
    // 关闭Kafka线程调度器,其实就是取消定期的Leader重选举
    kafkaScheduler.shutdown()
    // 将统计字段全部清0
    offlinePartitionCount = 0
    preferredReplicaImbalanceCount = 0
    globalTopicCount = 0
    globalPartitionCount = 0

    // stop token expiry check scheduler
    // 关闭Token过期检查调度器
    if (tokenCleanScheduler.isStarted)
      tokenCleanScheduler.shutdown()

    // de-register partition ISR listener for on-going partition reassignment task
    // 取消分区重分配监听器的注册
    unregisterPartitionReassignmentIsrChangeHandlers()
    // shutdown partition state machine
    // 关闭分区状态机
    partitionStateMachine.shutdown()
    // 取消主题变更监听器的注册
    zkClient.unregisterZNodeChildChangeHandler(topicChangeHandler.path)
    // 取消分区变更监听器的注册
    unregisterPartitionModificationsHandlers(partitionModificationsHandlers.keys.toSeq)
    // 取消主题删除监听器的注册
    zkClient.unregisterZNodeChildChangeHandler(topicDeletionHandler.path)
    // shutdown replica state machine
    // 关闭副本状态机
    replicaStateMachine.shutdown()
    // 取消Broker变更监听器的注册
    zkClient.unregisterZNodeChildChangeHandler(brokerChangeHandler.path)

    // 清空controller元数据
    controllerContext.resetContext()

    info("Resigned")
  }
2. controller 作用:

Controller是kafka一个非常重要的组件。主要的功能包括集群管理和主题管理。下面就结合源代码介绍下集群管理中的broke信息管理如何实现。

broker信息变更主要依靠BrokerModificationsHandler实现,当broker信息变更时会向eventManager发送BrokerModifications消息。

class BrokerModificationsHandler(controller: KafkaController, eventManager: ControllerEventManager, brokerId: Int) extends ZNodeChangeHandler {
  override val path: String = BrokerIdZNode.path(brokerId)

  override def handleDataChange(): Unit = {
    eventManager.put(controller.BrokerModifications(brokerId))
  }
}

broker信息变更的处理逻辑在BrokerModifications的process方法中。

case class BrokerModifications(brokerId: Int) extends ControllerEvent {
    override def state: ControllerState = ControllerState.BrokerChange

    override def process(): Unit = {
      if (!isActive) return
      // 从zookeeper中获取获取目标Broker的详细数据。
      val newMetadata = zkClient.getBroker(brokerId)
      // 从元数据缓存中获得目标Broker的详细数据
      val oldMetadata = controllerContext.liveBrokers.find(_.id == brokerId)
      // 如果两者不相等,说明Broker数据发生了变更
      if (newMetadata.nonEmpty && oldMetadata.nonEmpty && newMetadata.map(_.endPoints) != oldMetadata.map(_.endPoints)) {
        info(s"Updated broker: ${newMetadata.get}")
        // 更新元数据
        controllerContext.updateBrokerMetadata(oldMetadata, newMetadata)
        // 向其他Broker同步新的元数据
        onBrokerUpdate(brokerId)
      }
    }
  }
private def onBrokerUpdate(updatedBrokerId: Int) {
    info(s"Broker info update callback for $updatedBrokerId")
    // 给集群所有Broker发送更新元数据的请求UpdateMetadataRequest
    sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set.empty)
  }