目录
集群元数据 ControllerContext
ControllerStats
shuttingDownBrokerIds
epoch & epochZkVersion
liveBrokers
liveBrokerEpochs
allTopics
partitionAssignments
partitionLeadershipInfo
Controller与broker之间的请求发送 ControllerChannelManager
请求类型
LeaderAndIsrRequest
StopReplicaRequest
UpdateMetadataRequest
请求发送线程 (RequestSendThread)
请求发送类 ControllerChannelManager
startup()
addBroker()
sendRequest()
图解ControllerChannelManager
ControllerEventManager(事件请求处理)
ControllerEventThread(单线程事件处理模型)
KafkaController
controller选举场景
选举Controller
事实上,集群Broker是不会与Zookeeper直接交互去获取元数据的。相反,它们总是与Controller进行通信,获取和更新最新的集群数据。(而且,社区已经打算把Zookeeper干掉了)本文,我们讲解关于kafka的一个重要组件Controller.
首先介绍下关于Controller下的数据容器类。ControllerContext
集群元数据 ControllerContext
集群的元数据信息都保存在ControllerContext类中,应该说,这个类是Controller组件的数据容器类。
class ControllerContext {
val stats = new ControllerStats // Controller统计信息类
var offlinePartitionCount = 0 // 离线分区计数器
val shuttingDownBrokerIds = mutable.Set.empty[Int] // 关闭中Broker的Id列表
private val liveBrokers = mutable.Set.empty[Broker] // 当前运行中Broker对象列表
private val liveBrokerEpochs = mutable.Map.empty[Int, Long] // 运行中Broker Epoch列表
var epoch: Int = KafkaController.InitialControllerEpoch // Controller当前Epoch值
var epochZkVersion: Int = KafkaController.InitialControllerEpochZkVersion // Controller对应ZooKeeper节点的Epoch值
val allTopics = mutable.Set.empty[String] // 集群主题列表
val partitionAssignments = mutable.Map.empty[String, mutable.Map[Int, ReplicaAssignment]] // 主题分区的副本列表
val partitionLeadershipInfo = mutable.Map.empty[TopicPartition, LeaderIsrAndControllerEpoch] // 主题分区的Leader/ISR副本信息
val partitionsBeingReassigned = mutable.Set.empty[TopicPartition] // 正处于副本重分配过程的主题分区列表
val partitionStates = mutable.Map.empty[TopicPartition, PartitionState] // 主题分区状态列表 val re
ControllerStats
表示的是Controller中的一些统计信息。
shuttingDownBrokerIds
该字段保存了所有正在关闭中的BrokerId列表,当controller在管理集群broker时,依靠这个字段来判断broker当前是否关闭。
epoch & epochZkVersion
echo实际就是zk中/controller_epoch节点的值,可以认为就是controller在整个集群的版本号。
epochZkVersion实际就是/controller_epoch节点的dataVersion值。因为Controller的epochZkVersion要比老的Controller大,所以在原先老的Controller任期内的Controller操作在新的Controller上是不能成功执行的。
liveBrokers
保存当前运行的broker对象,每个broker对象都是一个<id,EndPoint,机架信息>的三元组
liveBrokerEpochs
保存运行的broker的Epoch信息。用来防止非常老的broker被选举为Controller.
allTopics
保存集群上所有的主题名称。每当有主题的增减,Controller都会更新该字段的值。
partitionAssignments
该字段保存所有主题分区的副本分配情况
partitionLeadershipInfo
保存了leader副本位于那个broker,leader epoch值是多少,ISR集合有哪些副本等等
关于Controller如何管理请求的发送呢,就是我们接下来要讲的 ControllerChannelManager
Controller与broker之间的请求发送 ControllerChannelManager
controller会给集群中的所有broker(包括controller自己所在的broker)机器发送网络请求。
目前controller只会向broker发送3类请求。
1.LeaderAndIsrRequest
2.StopReplicaRequest
3.UpdateMetadataRequest
请求类型
LeaderAndIsrRequest
它的主要功能就是告诉broker相关主题各个分区的leader副本位于那台Broker上,ISR的副本在那些broker上。
StopReplicaRequest
告知指定broker停止它上面的副本对象,主要用于分区副本迁移和删除主题的操作
UpdateMetadataRequest
该请求会更新Broker上的元数据缓存,集群上的所有元数据变更都首先发生在Controller端,然后在经由这个请求广播给集群上的所有broker.
请求发送线程 (RequestSendThread)
Controller会为集群上的每个Broker都创建一个对应的RequestSendThread线程,Broker上的这个线程,持续的从阻塞队列中获取待发送的请求。
//操作QueueItem
class RequestSendThread(val controllerId: Int, //controller所在broker的id
val controllerContext: ControllerContext, //controller元数据信息
val queue: BlockingQueue[QueueItem], //请求阻塞队列
val networkClient: NetworkClient, //用于执行发送的网络IO类
val brokerNode: Node, //目标broker节点
val config: KafkaConfig,//kafka配置信息
val time: Time,
val requestRateAndQueueTimeMetrics: Timer,
val stateChangeLogger: StateChangeLogger,
name: String)
extends ShutdownableThread(name = name) {
我们可以看见,每个RequestSendThread中维护了一个请求阻塞队列BlockingQueue[QueueItem],所以RequestSendThread需要发送的便是QueueItem类型的数据,每个QueueItem中实际保存的都是上面三类请求中的其中一类。。
发送流程在doWork()方法中
主要作用就是以阻塞形式从阻塞队列中获取请求,然后把它发送出去,之后等待Response的返回,在等待Response的过程中,线程一直处于阻塞状态,当接收到Response之后,调用callback执行请求处理完成的回调函数
请求发送类 ControllerChannelManager
它的主要作用就是管理Controller和Broker之间的连接,并为每个Broker创建RequestSendThread线程实例。
将要发送的请求放入到指定Broker的阻塞队列中,等待该broker专属的RequestSendThread线程处理。
//controller向broker发送请求所用的通道管理器
class ControllerChannelManager(controllerContext: ControllerContext, config: KafkaConfig, time: Time, metrics: Metrics,
stateChangeLogger: StateChangeLogger, threadNamePrefix: Option[String] = None) extends Logging with KafkaMetricsGroup {
import ControllerChannelManager._
protected val brokerStateInfo = new HashMap[Int, ControllerBrokerStateInfo]
....
这里解释下brokerStateInfo,key为集群中broker的id信息,value是一个ControllerBrokerStateInfo。
startup()
controller组件启动时,会调用此方法,该方法会从元数据信息中找到集群的Broker列表,然后依次为它们调用addBroker方法
addBroker()
添加目标Broker到brokerStateInfo,启动RequestSendThread线程
sendRequest()
发送请求,实际上就是把请求对象提交到请求队列中
图解ControllerChannelManager
接下来,Controller就需要依靠ControllerEventManager来完成事件的请求处理。
ControllerEventManager(事件请求处理)
ControllerEventThread(单线程事件处理模型)
比如添加分区,控制器需要负责创建这些分区的同时,还需要更新上下文信息,并且要同步到其他节点,不管是监听器触发的事件,还是定时任务触发的事件,或者是其他事件,都需要拉取或者更新上下文信息,
所以就涉及到多线程的同步,但是,单纯使用锁来控制线程安全,性能肯定会下降,所以采用单线程模型,将每个事件按顺序都暂存到LinkedBlockingQueue中,然后通过一个专属线程ControllerEventThread按照FIFO的原则来顺序处理各个事件。
KafkaController
controller选举场景
可能触发Controller触发的三个场景:
1.集群从零启动时;
2.Broker 侦测 /controller 节点消失时
3.Broker 侦测到 /controller 节点数据发生变更时。
集群从零启动:
集群首次启动时,Controller还没选举出来,于是,Broker启动后,首先将startup这个ControllerEvent写入到事件队列中,然后启动对应的事件处理线程和ControllerChangerHandler Zookeeper监听器,最后依赖事件处理线程进行选举。
当 Broker 启动时,它会调用这个方法启动 ControllerEventThread 线程。
//集群首次启动时,会将startup的事件Startup写入到事件队列中,然后启动对应的事件处理线程和ControllerChangeHandler zookeeper监听器
def startup() = {
//注册zk状态变更监听器,它是用于监听zk和broker的会话过期的
zkClient.registerStateChangeHandler(new StateChangeHandler {
override val name: String = StateChangeHandlers.ControllerHandler
override def afterInitializingSession(): Unit = {
eventManager.put(RegisterBrokerAndReelect)
}
override def beforeInitializingSession(): Unit = {
val expireEvent = new Expire
eventManager.clearAndPut(expireEvent)
// Block initialization of the new session until the expiration event is being handled,
// which ensures that all pending events have been processed before creating the new session
expireEvent.waitUntilProcessingStarted()
}
})
//写入startup事件到事件队列
eventManager.put(Startup)
//启动ControllerEventThread线程,开始处理事件队列中的ControllerEvent
eventManager.start()
}
startup方法注册zk状态变更监听器,用于监听broker和zookeeper之间的会话是否过期,接着写入startup事件到事件队列,然后启动ControllerEventThread线程,开始处理事件队列里的startup事件。
//处理startup事件
case object Startup extends ControllerEvent {
def state = ControllerState.ControllerChange
override def process(): Unit = {
//注册ControllerChangerHandler zookeeper监听器
zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
//执行controller选举
elect()
}
}
broker 侦测 /controller 节点消失时:
Broker 检测到 /controller 节点消失时,就意味着,此时整个集群中没有 Controller。因此,所有检测到 /controller 节点消失的 Broker,都会立即调用 elect 方法执行竞选逻辑
/controller 节点数据变更:
如果 Broker 之前是 Controller,那么该 Broker 需要首先执行卸任操作,然后再尝试竞选;
如果 Broker 之前不是 Controller,那么,该 Broker 直接去竞选新 Controller。
//卸任逻辑
private def onControllerResignation() {
debug("Resigning")
// 取消zk监听器的注册
zkClient.unregisterZNodeChildChangeHandler(isrChangeNotificationHandler.path)
zkClient.unregisterZNodeChangeHandler(partitionReassignmentHandler.path)
zkClient.unregisterZNodeChangeHandler(preferredReplicaElectionHandler.path)
zkClient.unregisterZNodeChildChangeHandler(logDirEventNotificationHandler.path)
unregisterBrokerModificationsHandler(brokerModificationsHandlers.keySet)
// reset topic deletion manager
topicDeletionManager.reset()
// 关闭kafka线程调度器,取消定期的leader重选举
kafkaScheduler.shutdown()
//将统计字段全部清0
offlinePartitionCount = 0
preferredReplicaImbalanceCount = 0
globalTopicCount = 0
globalPartitionCount = 0
// 关闭token过期检查调度器
if (tokenCleanScheduler.isStarted)
tokenCleanScheduler.shutdown()
//取消分区重分配监听器的注册
unregisterPartitionReassignmentIsrChangeHandlers()
// 关闭分区状态机
partitionStateMachine.shutdown()
//取消主题变更监听器的注册
zkClient.unregisterZNodeChildChangeHandler(topicChangeHandler.path)
//取消分区变更监听器注册
unregisterPartitionModificationsHandlers(partitionModificationsHandlers.keys.toSeq)
//取消主题删除监听器的注册
zkClient.unregisterZNodeChildChangeHandler(topicDeletionHandler.path)
// shutdown replica state machine
replicaStateMachine.shutdown()
zkClient.unregisterZNodeChildChangeHandler(brokerChangeHandler.path)
//清空集群元数据
controllerContext.resetContext()
info("Resigned")
}
选举Controller
//controller选举
private def elect(): Unit = {
val timestamp = time.milliseconds
//获取当前Controllers所在broker的序号,如果controller不存在,就显示标记为-1
activeControllerId = zkClient.getControllerId.getOrElse(-1)
/*
* We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition,
* it's possible that the controller has already been elected when we get here. This check will prevent the following
* createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
*/
//如果controller已经选出来了,就返回
if (activeControllerId != -1) {
debug(s"Broker $activeControllerId has been elected as the controller, so stopping the election process.")
return
}
try {
//zk上创建controller节点
zkClient.checkedEphemeralCreate(ControllerZNode.path, ControllerZNode.encode(config.brokerId, timestamp))
info(s"${config.brokerId} successfully elected as the controller")
//将创建到controller节点的broker指定为新的controller
activeControllerId = config.brokerId
//执行当选controller的后续逻辑
onControllerFailover()
} catch {
case _: NodeExistsException =>
// If someone else has written the path, then
activeControllerId = zkClient.getControllerId.getOrElse(-1)
if (activeControllerId != -1)
debug(s"Broker $activeControllerId was elected as controller instead of broker ${config.brokerId}")
else
warn("A controller has been elected but just resigned, this will result in another round of election")
case e2: Throwable =>
error(s"Error while electing or becoming controller on broker ${config.brokerId}", e2)
triggerControllerMove()
}
}
该方法,首先检查Controller是否已经选出来了,集群中的所有broker都要执行这些逻辑,所以,非常有可能出现某些Broker在执行elect方法时,Controller已经被选出来了,如果选出来,自然不需要在做什么,如果没有,代码会尝试创建/Controller节点去抢注Controller.
一旦抢注成功,就会调用onControllerFailover方法,执行选举成功后的操作,比如注册各类zk监听器,删除日志路径变更,ISR副本变更通知,启动Controller通道管理器,启动副本状态机和分区状态机等。
//当选controller的后续逻辑
private def onControllerFailover() {
info("Reading controller epoch from ZooKeeper")
readControllerEpochFromZooKeeper()
info("Incrementing controller epoch in ZooKeeper")
incrementControllerEpoch()
info("Registering handlers")
// before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler,
isrChangeNotificationHandler)
childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler)
val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler)
//注册各类zookeeper监听器
nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence)
info("Deleting log dir event notifications")
//删除日志路径变更和ISR变更通知
zkClient.deleteLogDirEventNotifications()
info("Deleting isr change notifications")
zkClient.deleteIsrChangeNotifications()
info("Initializing controller context")
//初始化集群元数据
initializeControllerContext()
info("Fetching topic deletions in progress")
val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
info("Initializing topic deletion manager")
topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)
// We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
// are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
// they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
// partitionStateMachine.startup().
info("Sending update metadata request")
//给broker发送元数据更新请求
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
//启动副本状态机和分区状态机
replicaStateMachine.startup()
partitionStateMachine.startup()
info(s"Ready to serve as the new controller with epoch $epoch")
maybeTriggerPartitionReassignment(controllerContext.partitionsBeingReassigned.keySet)
topicDeletionManager.tryTopicDeletion()
val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections()
onPreferredReplicaElection(pendingPreferredReplicaElections)
info("Starting the controller scheduler")
kafkaScheduler.startup()
if (config.autoLeaderRebalanceEnable) {
scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS)
}
if (config.tokenAuthEnabled) {
info("starting the token expiry check scheduler")
tokenCleanScheduler.startup()
tokenCleanScheduler.schedule(name = "delete-expired-tokens",
fun = tokenManager.expireTokens,
period = config.delegationTokenExpiryCheckIntervalMs,
unit = TimeUnit.MILLISECONDS)
}
}