目录

集群元数据  ControllerContext

     ControllerStats   

     shuttingDownBrokerIds   

      epoch & epochZkVersion

     liveBrokers

     liveBrokerEpochs

     allTopics

     partitionAssignments

     partitionLeadershipInfo

Controller与broker之间的请求发送   ControllerChannelManager

     请求类型

     LeaderAndIsrRequest

     StopReplicaRequest

     UpdateMetadataRequest

      请求发送线程 (RequestSendThread)

      请求发送类 ControllerChannelManager

     startup() 

     addBroker()   

     sendRequest()

        图解ControllerChannelManager

ControllerEventManager(事件请求处理)

         ControllerEventThread(单线程事件处理模型)

 

KafkaController

        controller选举场景

        选举Controller


 事实上,集群Broker是不会与Zookeeper直接交互去获取元数据的。相反,它们总是与Controller进行通信,获取和更新最新的集群数据。(而且,社区已经打算把Zookeeper干掉了)本文,我们讲解关于kafka的一个重要组件Controller.

 

首先介绍下关于Controller下的数据容器类。ControllerContext

集群元数据  ControllerContext

       集群的元数据信息都保存在ControllerContext类中,应该说,这个类是Controller组件的数据容器类。

class ControllerContext { 
                val stats = new ControllerStats // Controller统计信息类 
                var offlinePartitionCount = 0 // 离线分区计数器 
                val shuttingDownBrokerIds = mutable.Set.empty[Int] // 关闭中Broker的Id列表                 
                private val liveBrokers = mutable.Set.empty[Broker] // 当前运行中Broker对象列表 
                private val liveBrokerEpochs = mutable.Map.empty[Int, Long] // 运行中Broker Epoch列表 
                var epoch: Int = KafkaController.InitialControllerEpoch // Controller当前Epoch值 
                var epochZkVersion: Int = KafkaController.InitialControllerEpochZkVersion // Controller对应ZooKeeper节点的Epoch值 
                val allTopics = mutable.Set.empty[String] // 集群主题列表 
                val partitionAssignments = mutable.Map.empty[String, mutable.Map[Int, ReplicaAssignment]] // 主题分区的副本列表 
                val partitionLeadershipInfo = mutable.Map.empty[TopicPartition, LeaderIsrAndControllerEpoch] // 主题分区的Leader/ISR副本信息 
                val partitionsBeingReassigned = mutable.Set.empty[TopicPartition] // 正处于副本重分配过程的主题分区列表 
                val partitionStates = mutable.Map.empty[TopicPartition, PartitionState] // 主题分区状态列表 val re

     ControllerStats   

表示的是Controller中的一些统计信息。

 

     shuttingDownBrokerIds   

该字段保存了所有正在关闭中的BrokerId列表,当controller在管理集群broker时,依靠这个字段来判断broker当前是否关闭。

 

     epoch & epochZkVersion

echo实际就是zk中/controller_epoch节点的值,可以认为就是controller在整个集群的版本号。

epochZkVersion实际就是/controller_epoch节点的dataVersion值。因为Controller的epochZkVersion要比老的Controller大,所以在原先老的Controller任期内的Controller操作在新的Controller上是不能成功执行的。

 

     liveBrokers

保存当前运行的broker对象,每个broker对象都是一个<id,EndPoint,机架信息>的三元组

 

     liveBrokerEpochs

保存运行的broker的Epoch信息。用来防止非常老的broker被选举为Controller.

 

     allTopics

保存集群上所有的主题名称。每当有主题的增减,Controller都会更新该字段的值。

 

     partitionAssignments

该字段保存所有主题分区的副本分配情况

 

     partitionLeadershipInfo

保存了leader副本位于那个broker,leader epoch值是多少,ISR集合有哪些副本等等

 

关于Controller如何管理请求的发送呢,就是我们接下来要讲的 ControllerChannelManager

Controller与broker之间的请求发送   ControllerChannelManager

    controller会给集群中的所有broker(包括controller自己所在的broker)机器发送网络请求。

    目前controller只会向broker发送3类请求。

                                          1.LeaderAndIsrRequest

                                          2.StopReplicaRequest

                                          3.UpdateMetadataRequest

请求类型

     LeaderAndIsrRequest

       它的主要功能就是告诉broker相关主题各个分区的leader副本位于那台Broker上,ISR的副本在那些broker上。

 

     StopReplicaRequest

        告知指定broker停止它上面的副本对象,主要用于分区副本迁移删除主题的操作

 

     UpdateMetadataRequest

        该请求会更新Broker上的元数据缓存,集群上的所有元数据变更都首先发生在Controller端,然后在经由这个请求广播给集群上的所有broker.

请求发送线程 (RequestSendThread)

         

       Controller会为集群上的每个Broker都创建一个对应的RequestSendThread线程,Broker上的这个线程,持续的从阻塞队列中获取待发送的请求。

//操作QueueItem
class RequestSendThread(val controllerId: Int,    //controller所在broker的id
                        val controllerContext: ControllerContext, //controller元数据信息
                        val queue: BlockingQueue[QueueItem],  //请求阻塞队列
                        val networkClient: NetworkClient, //用于执行发送的网络IO类
                        val brokerNode: Node, //目标broker节点
                        val config: KafkaConfig,//kafka配置信息
                        val time: Time,
                        val requestRateAndQueueTimeMetrics: Timer,
                        val stateChangeLogger: StateChangeLogger,
                        name: String)
  extends ShutdownableThread(name = name) {

我们可以看见,每个RequestSendThread中维护了一个请求阻塞队列BlockingQueue[QueueItem],所以RequestSendThread需要发送的便是QueueItem类型的数据,每个QueueItem中实际保存的都是上面三类请求中的其中一类。。

发送流程在doWork()方法中

主要作用就是以阻塞形式从阻塞队列中获取请求,然后把它发送出去,之后等待Response的返回,在等待Response的过程中,线程一直处于阻塞状态,当接收到Response之后,调用callback执行请求处理完成的回调函数

 

请求发送类 ControllerChannelManager

        它的主要作用就是管理Controller和Broker之间的连接,并为每个Broker创建RequestSendThread线程实例。

        将要发送的请求放入到指定Broker的阻塞队列中,等待该broker专属的RequestSendThread线程处理。

//controller向broker发送请求所用的通道管理器
class ControllerChannelManager(controllerContext: ControllerContext, config: KafkaConfig, time: Time, metrics: Metrics,
                               stateChangeLogger: StateChangeLogger, threadNamePrefix: Option[String] = None) extends Logging with KafkaMetricsGroup {
  import ControllerChannelManager._
  protected val brokerStateInfo = new HashMap[Int, ControllerBrokerStateInfo]
....

这里解释下brokerStateInfo,key为集群中broker的id信息,value是一个ControllerBrokerStateInfo。

     startup() 

controller组件启动时,会调用此方法,该方法会从元数据信息中找到集群的Broker列表,然后依次为它们调用addBroker方法

     addBroker()   

添加目标Broker到brokerStateInfo,启动RequestSendThread线程

     sendRequest()

 发送请求,实际上就是把请求对象提交到请求队列中  

 

 

 

图解ControllerChannelManager

kafka中controller选举 kafka controller epoch_事件队列

 

 

接下来,Controller就需要依靠ControllerEventManager来完成事件的请求处理。

ControllerEventManager(事件请求处理)

 

 

ControllerEventThread(单线程事件处理模型)

 

                       

kafka中controller选举 kafka controller epoch_元数据_02

       比如添加分区,控制器需要负责创建这些分区的同时,还需要更新上下文信息,并且要同步到其他节点,不管是监听器触发的事件,还是定时任务触发的事件,或者是其他事件,都需要拉取或者更新上下文信息,
       所以就涉及到多线程的同步,但是,单纯使用锁来控制线程安全,性能肯定会下降,所以采用单线程模型,将每个事件按顺序都暂存到LinkedBlockingQueue中,然后通过一个专属线程ControllerEventThread按照FIFO的原则来顺序处理各个事件。

KafkaController

controller选举场景

               可能触发Controller触发的三个场景:

              1.集群从零启动时;

              2.Broker 侦测 /controller 节点消失时

              3.Broker 侦测到 /controller 节点数据发生变更时。

集群从零启动

集群首次启动时,Controller还没选举出来,于是,Broker启动后,首先将startup这个ControllerEvent写入到事件队列中,然后启动对应的事件处理线程和ControllerChangerHandler Zookeeper监听器,最后依赖事件处理线程进行选举。

当 Broker 启动时,它会调用这个方法启动 ControllerEventThread 线程。

//集群首次启动时,会将startup的事件Startup写入到事件队列中,然后启动对应的事件处理线程和ControllerChangeHandler zookeeper监听器
  def startup() = {
      //注册zk状态变更监听器,它是用于监听zk和broker的会话过期的
    zkClient.registerStateChangeHandler(new StateChangeHandler {
      override val name: String = StateChangeHandlers.ControllerHandler
      override def afterInitializingSession(): Unit = {
        eventManager.put(RegisterBrokerAndReelect)
      }
      override def beforeInitializingSession(): Unit = {
        val expireEvent = new Expire
        eventManager.clearAndPut(expireEvent)

        // Block initialization of the new session until the expiration event is being handled,
        // which ensures that all pending events have been processed before creating the new session
        expireEvent.waitUntilProcessingStarted()
      }
    })
      //写入startup事件到事件队列
    eventManager.put(Startup)
      //启动ControllerEventThread线程,开始处理事件队列中的ControllerEvent
    eventManager.start()
  }

startup方法注册zk状态变更监听器,用于监听broker和zookeeper之间的会话是否过期,接着写入startup事件到事件队列,然后启动ControllerEventThread线程,开始处理事件队列里的startup事件。

//处理startup事件 
case object Startup extends ControllerEvent {

    def state = ControllerState.ControllerChange

    override def process(): Unit = {
      //注册ControllerChangerHandler zookeeper监听器
      zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
      //执行controller选举
      elect()
    }

  }

broker 侦测 /controller 节点消失时:

Broker 检测到 /controller 节点消失时,就意味着,此时整个集群中没有 Controller。因此,所有检测到 /controller 节点消失的 Broker,都会立即调用 elect 方法执行竞选逻辑

/controller 节点数据变更:

如果 Broker 之前是 Controller,那么该 Broker 需要首先执行卸任操作,然后再尝试竞选;

如果 Broker 之前不是 Controller,那么,该 Broker 直接去竞选新 Controller。

//卸任逻辑
  private def onControllerResignation() {
    debug("Resigning")
    // 取消zk监听器的注册
    zkClient.unregisterZNodeChildChangeHandler(isrChangeNotificationHandler.path)
    zkClient.unregisterZNodeChangeHandler(partitionReassignmentHandler.path)
    zkClient.unregisterZNodeChangeHandler(preferredReplicaElectionHandler.path)
    zkClient.unregisterZNodeChildChangeHandler(logDirEventNotificationHandler.path)
    unregisterBrokerModificationsHandler(brokerModificationsHandlers.keySet)

    // reset topic deletion manager
    topicDeletionManager.reset()

    // 关闭kafka线程调度器,取消定期的leader重选举
    kafkaScheduler.shutdown()
    //将统计字段全部清0
    offlinePartitionCount = 0
    preferredReplicaImbalanceCount = 0
    globalTopicCount = 0
    globalPartitionCount = 0

    // 关闭token过期检查调度器
    if (tokenCleanScheduler.isStarted)
      tokenCleanScheduler.shutdown()

    //取消分区重分配监听器的注册
    unregisterPartitionReassignmentIsrChangeHandlers()
    // 关闭分区状态机
    partitionStateMachine.shutdown()
    //取消主题变更监听器的注册
    zkClient.unregisterZNodeChildChangeHandler(topicChangeHandler.path)
    //取消分区变更监听器注册
    unregisterPartitionModificationsHandlers(partitionModificationsHandlers.keys.toSeq)
    //取消主题删除监听器的注册
    zkClient.unregisterZNodeChildChangeHandler(topicDeletionHandler.path)
    // shutdown replica state machine
    replicaStateMachine.shutdown()
    zkClient.unregisterZNodeChildChangeHandler(brokerChangeHandler.path)
    //清空集群元数据
    controllerContext.resetContext()

    info("Resigned")
  }

选举Controller

//controller选举
  private def elect(): Unit = {
    val timestamp = time.milliseconds
    //获取当前Controllers所在broker的序号,如果controller不存在,就显示标记为-1
    activeControllerId = zkClient.getControllerId.getOrElse(-1)
    /*
     * We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition,
     * it's possible that the controller has already been elected when we get here. This check will prevent the following
     * createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
     */
    //如果controller已经选出来了,就返回
    if (activeControllerId != -1) {
      debug(s"Broker $activeControllerId has been elected as the controller, so stopping the election process.")
      return
    }

    try {
      //zk上创建controller节点
      zkClient.checkedEphemeralCreate(ControllerZNode.path, ControllerZNode.encode(config.brokerId, timestamp))
      info(s"${config.brokerId} successfully elected as the controller")
      //将创建到controller节点的broker指定为新的controller
      activeControllerId = config.brokerId
      //执行当选controller的后续逻辑
      onControllerFailover()
    } catch {
      case _: NodeExistsException =>
        // If someone else has written the path, then
        activeControllerId = zkClient.getControllerId.getOrElse(-1)

        if (activeControllerId != -1)
          debug(s"Broker $activeControllerId was elected as controller instead of broker ${config.brokerId}")
        else
          warn("A controller has been elected but just resigned, this will result in another round of election")

      case e2: Throwable =>
        error(s"Error while electing or becoming controller on broker ${config.brokerId}", e2)
        triggerControllerMove()
    }
  }

该方法,首先检查Controller是否已经选出来了,集群中的所有broker都要执行这些逻辑,所以,非常有可能出现某些Broker在执行elect方法时,Controller已经被选出来了,如果选出来,自然不需要在做什么,如果没有,代码会尝试创建/Controller节点去抢注Controller.

一旦抢注成功,就会调用onControllerFailover方法,执行选举成功后的操作,比如注册各类zk监听器,删除日志路径变更,ISR副本变更通知,启动Controller通道管理器,启动副本状态机和分区状态机等。

//当选controller的后续逻辑
  private def onControllerFailover() {
    info("Reading controller epoch from ZooKeeper")
    readControllerEpochFromZooKeeper()
    info("Incrementing controller epoch in ZooKeeper")
    incrementControllerEpoch()
    info("Registering handlers")

    // before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
    val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler,
      isrChangeNotificationHandler)
    childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler)
    val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler)
      //注册各类zookeeper监听器
    nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence)

    info("Deleting log dir event notifications")
      //删除日志路径变更和ISR变更通知
    zkClient.deleteLogDirEventNotifications()
    info("Deleting isr change notifications")
    zkClient.deleteIsrChangeNotifications()
    info("Initializing controller context")
      //初始化集群元数据
    initializeControllerContext()
    info("Fetching topic deletions in progress")
    val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
    info("Initializing topic deletion manager")
    topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)

    // We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
    // are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
    // they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
    // partitionStateMachine.startup().
    info("Sending update metadata request")
      //给broker发送元数据更新请求
    sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
    //启动副本状态机和分区状态机
    replicaStateMachine.startup()
    partitionStateMachine.startup()

    info(s"Ready to serve as the new controller with epoch $epoch")
    maybeTriggerPartitionReassignment(controllerContext.partitionsBeingReassigned.keySet)
    topicDeletionManager.tryTopicDeletion()
    val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections()
    onPreferredReplicaElection(pendingPreferredReplicaElections)
    info("Starting the controller scheduler")
    kafkaScheduler.startup()
    if (config.autoLeaderRebalanceEnable) {
      scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS)
    }

    if (config.tokenAuthEnabled) {
      info("starting the token expiry check scheduler")
      tokenCleanScheduler.startup()
      tokenCleanScheduler.schedule(name = "delete-expired-tokens",
        fun = tokenManager.expireTokens,
        period = config.delegationTokenExpiryCheckIntervalMs,
        unit = TimeUnit.MILLISECONDS)
    }
  }