众所周知kafka是集群模式,那么kafka是如何保证数据一致性,以及集群间和消费者是如何交互的呢?
首先先来了解几个名称:
AR:分区中所有副本统称为AR(Assigned Replicas)。
ISR:所有与leader副本保持一定程度同步风副本(包括leader副本本身)组成ISR(IN-Sync Replicas),ISR集合是AR集合中的一个子集。
replica.lag.time.max.ms这个就是follower副本落后leader副本的时间间隔,默认30秒。只要follower副本每隔30s都能发送FetchRequest请求给leader,那么该副本就不会被标记成dead从而被踢出ISR。
OSR: 与leader副本同步滞后过多的副本(不包括leader)副本,组成OSR(Out-Sync Relipcas),由此可见:AR=ISR+OSR。
ISR的伸缩:
Kafka在启动的时候会开启两个与ISR相关的定时任务,名称分别为“isr-expiration"和”isr-change-propagation"。
scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("shutdown-idle-replica-alter-log-dirs-thread", shutdownIdleReplicaAlterLogDirsThread _, period = 10000L, unit = TimeUnit.MILLISECONDS)
isr-expiration任务会周期性的检测每个分区是否需要缩减其ISR集合。这个周期和“replica.lag.time.max.ms”参数有关,大小是这个参数一半。当检测到ISR中有是失效的副本的时候,就会缩减ISR集合。如果某个分区的ISR集合发生变更, 则会将变更后的数据记录到ZooKerper对应/brokers/topics/$topic/partitions/$partition.partition/state节点中。节点中数据示例如下:
{“controller_cpoch":26,“leader”:0,“version”:1,“leader_epoch”:2,“isr”:{0,1}}
object BrokersZNode {
def path = "/brokers"
}
object TopicsZNode {
def path = s"${BrokersZNode.path}/topics"
}
object TopicZNode {
def path(topic: String) = s"${TopicsZNode.path}/$topic"
def encode(assignment: collection.Map[TopicPartition, ReplicaAssignment]): Array[Byte] = {
val replicaAssignmentJson = mutable.Map[String, util.List[Int]]()
val addingReplicasAssignmentJson = mutable.Map[String, util.List[Int]]()
val removingReplicasAssignmentJson = mutable.Map[String, util.List[Int]]()
for ((partition, replicaAssignment) <- assignment) {
replicaAssignmentJson += (partition.partition.toString -> replicaAssignment.replicas.asJava)
if (replicaAssignment.addingReplicas.nonEmpty)
addingReplicasAssignmentJson += (partition.partition.toString -> replicaAssignment.addingReplicas.asJava)
if (replicaAssignment.removingReplicas.nonEmpty)
removingReplicasAssignmentJson += (partition.partition.toString -> replicaAssignment.removingReplicas.asJava)
}
Json.encodeAsBytes(Map(
"version" -> 2,
"partitions" -> replicaAssignmentJson.asJava,
"adding_replicas" -> addingReplicasAssignmentJson.asJava,
"removing_replicas" -> removingReplicasAssignmentJson.asJava
).asJava)
}
def decode(topic: String, bytes: Array[Byte]): Map[TopicPartition, ReplicaAssignment] = {
def getReplicas(replicasJsonOpt: Option[JsonObject], partition: String): Seq[Int] = {
replicasJsonOpt match {
case Some(replicasJson) => replicasJson.get(partition) match {
case Some(ar) => ar.to[Seq[Int]]
case None => Seq.empty[Int]
}
case None => Seq.empty[Int]
}
}
Json.parseBytes(bytes).flatMap { js =>
val assignmentJson = js.asJsonObject
val partitionsJsonOpt = assignmentJson.get("partitions").map(_.asJsonObject)
val addingReplicasJsonOpt = assignmentJson.get("adding_replicas").map(_.asJsonObject)
val removingReplicasJsonOpt = assignmentJson.get("removing_replicas").map(_.asJsonObject)
partitionsJsonOpt.map { partitionsJson =>
partitionsJson.iterator.map { case (partition, replicas) =>
new TopicPartition(topic, partition.toInt) -> ReplicaAssignment(
replicas.to[Seq[Int]],
getReplicas(addingReplicasJsonOpt, partition),
getReplicas(removingReplicasJsonOpt, partition)
)
}
}
}.map(_.toMap).getOrElse(Map.empty)
}
}
object TopicPartitionsZNode {
def path(topic: String) = s"${TopicZNode.path(topic)}/partitions"
}
object TopicPartitionZNode {
def path(partition: TopicPartition) = s"${TopicPartitionsZNode.path(partition.topic)}/${partition.partition}"
}
object TopicPartitionStateZNode {
def path(partition: TopicPartition) = s"${TopicPartitionZNode.path(partition)}/state"
def encode(leaderIsrAndControllerEpoch: LeaderIsrAndControllerEpoch): Array[Byte] = {
val leaderAndIsr = leaderIsrAndControllerEpoch.leaderAndIsr
val controllerEpoch = leaderIsrAndControllerEpoch.controllerEpoch
Json.encodeAsBytes(Map("version" -> 1, "leader" -> leaderAndIsr.leader, "leader_epoch" -> leaderAndIsr.leaderEpoch,
"controller_epoch" -> controllerEpoch, "isr" -> leaderAndIsr.isr.asJava).asJava)
}
def decode(bytes: Array[Byte], stat: Stat): Option[LeaderIsrAndControllerEpoch] = {
Json.parseBytes(bytes).map { js =>
val leaderIsrAndEpochInfo = js.asJsonObject
val leader = leaderIsrAndEpochInfo("leader").to[Int]
val epoch = leaderIsrAndEpochInfo("leader_epoch").to[Int]
val isr = leaderIsrAndEpochInfo("isr").to[List[Int]]
val controllerEpoch = leaderIsrAndEpochInfo("controller_epoch").to[Int]
val zkPathVersion = stat.getVersion
LeaderIsrAndControllerEpoch(LeaderAndIsr(leader, epoch, isr, zkPathVersion), controllerEpoch)
}
}
}
// update ISR in zk and in cache
private[cluster] def shrinkIsr(newIsr: Set[Int]): Unit = {
val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.toList, zkVersion)
val zkVersionOpt = stateStore.shrinkIsr(controllerEpoch, newLeaderAndIsr)
maybeUpdateIsrAndVersion(newIsr, zkVersionOpt)
}
其中controller_epoch表示的是当前的kafka控制器epoch.leader表示当前分区的leader副本所在的broker的id编号,version表示版本号,(当前半本固定位1),leader_epoch表示当前分区的leader纪元,isr表示变更后的isr列表。
除此之外,当ISR集合发生变更的时候还会将变更后的记录缓存到isrChangeSet中,isr-change-propagation任务会周期性(固定值为2500ms)地检查isrChangeSet,如果发现isrChangeSet中有ISR 集合的变更记录,那么它会在Zookeeper的 /isr_change_notification的路径下创建一个以isr_change开头的持久顺序节点,比如:
/isr_change_notification/isr_change_0000000000
object IsrChangeNotificationZNode {
def path = "/isr_change_notification"
}
object IsrChangeNotificationSequenceZNode {
val SequenceNumberPrefix = "isr_change_"
def path(sequenceNumber: String = "") = s"${IsrChangeNotificationZNode.path}/$SequenceNumberPrefix$sequenceNumber"
def encode(partitions: collection.Set[TopicPartition]): Array[Byte] = {
val partitionsJson = partitions.map(partition => Map("topic" -> partition.topic, "partition" -> partition.partition).asJava)
Json.encodeAsBytes(Map("version" -> IsrChangeNotificationHandler.Version, "partitions" -> partitionsJson.asJava).asJava)
}
def decode(bytes: Array[Byte]): Set[TopicPartition] = {
Json.parseBytes(bytes).map { js =>
val partitionsJson = js.asJsonObject("partitions").asJsonArray
partitionsJson.iterator.map { partitionsJson =>
val partitionJson = partitionsJson.asJsonObject
val topic = partitionJson("topic").to[String]
val partition = partitionJson("partition").to[Int]
new TopicPartition(topic, partition)
}
}
}.map(_.toSet).getOrElse(Set.empty)
def sequenceNumber(path: String) = path.substring(path.lastIndexOf(SequenceNumberPrefix) + SequenceNumberPrefix.length)
}
并将isrChangeSet中的信息保存到这个节点中。kafka控制器为/isr_change_notification添加了一个Watcher,当这个节点中有子节点发生变化的时候会触发Watcher动作,以此通知控制器更新相关的元数据信息并向它管理的broker节点发送更新元数据信息的请求。最后删除/isr_change_notification的路径下已经处理过的节点。频繁的触发Watcher会影响kafka控制器,zookeeper甚至其他的broker性能。为了避免这种情况,kafka添加了指定的条件,当检测到分区ISR集合发生变化的时候,还需要检查一下两个条件:
1. 上一次ISR集合发生变化距离现在已经超过5秒
2. 上一次写入zookeeper的时候距离现在已经超过60秒。
/**
* Gets the isr change notifications as strings. These strings are the znode names and not the absolute znode path.
* @return sequence of znode names and not the absolute znode path.
*/
def getAllIsrChangeNotifications: Seq[String] = {
val getChildrenResponse = retryRequestUntilConnected(GetChildrenRequest(IsrChangeNotificationZNode.path, registerWatch = true))
getChildrenResponse.resultCode match {
case Code.OK => getChildrenResponse.children.map(IsrChangeNotificationSequenceZNode.sequenceNumber)
case Code.NONODE => Seq.empty
case _ => throw getChildrenResponse.resultException.get
}
}
private val lastIsrChangeMs = new AtomicLong(System.currentTimeMillis())
private val lastIsrPropagationMs = new AtomicLong(System.currentTimeMillis())
val HighWatermarkFilename = "replication-offset-checkpoint"
val IsrChangePropagationBlackOut = 5000L
val IsrChangePropagationInterval = 60000L
def maybePropagateIsrChanges(): Unit = {
val now = System.currentTimeMillis()
isrChangeSet synchronized {
if (isrChangeSet.nonEmpty &&
(lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
zkClient.propagateIsrChanges(isrChangeSet)
isrChangeSet.clear()
lastIsrPropagationMs.set(now)
}
}
}
有缩减就会有补充,那么kafka何时扩充ISR的?
随着follower副本不断进行消息同步,follower副本LEO也会逐渐后移,并且最终赶上leader副本,此时follower副本就有资格进入ISR集合,追赶上leader副本的判定准则是此副本的LEO是否小于leader副本HW,这里并不是和leader副本LEO相比。ISR扩充之后同样会更新ZooKeeper中的/broker/topics/partition/state节点和isrChangeSet,之后的步骤就和ISR收缩的时的相同。
这里提到了LEO和HW:
HW:是High watermark,俗称高水位,它标识一个特定的消息偏移量(offset),消费者只能拉取到这个offset之前的消息。
LEO:是Log End offset,它标识当前日志文件中下一条待写入消息的offset。LEO的大小相当于当前日志分区最后一条消息的offset加1.
ISR与HW和LEO有密切的关系。如下图:
HW为ISR的最小的LEO,如果follower2被剔除ISR,那么HW就成5。
接下来看下producer保存消息的流程:
那么具体是怎么操作的呢?
1. 生产者发送请求向某些指定分区追加消息
2. ProducerRequest经过网络层和API层到达ReplicaManager,他会将消息交给日志存储系统进行处理,最终追加到对应的log中,同时还是检测delayedFetchPurgatory中相关key对应的DelayedFetch,满足条件则将其执行完成。
3. 日志存储系统返回追加消息的结果
4. ReplicaManager为ProducerRequest生成DelayedProduce对象,并交由delayedProducePurgatory管理
5. delayedProducePurgatory使用SystemTimer管理DelayedProduce是否超时
6. ISR集合中Follower副本发送FetchRequest请求与Leader副本同步消息,同时也会检查DelayedProduce是否符合执行条件。
7. DelayProduce执行时会调用回调函数产生ProducerResponse,并将其添加到RequestChannels中
8. 由网络层将ProduceResponse返回给客户端。
而我们这次需要关注点在于5和6,leader节点是怎么通知follower节点的呢?
def appendRecordsToLeader(records: MemoryRecords, origin: AppendOrigin, requiredAcks: Int): LogAppendInfo = {
val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
leaderLogIfLocal match {
case Some(leaderLog) =>
val minIsr = leaderLog.config.minInSyncReplicas
val inSyncSize = inSyncReplicaIds.size
// Avoid writing to leader if there are not enough insync replicas to make it safe
if (inSyncSize < minIsr && requiredAcks == -1) {
throw new NotEnoughReplicasException(s"The size of the current ISR $inSyncReplicaIds " +
s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
}
val info = leaderLog.appendAsLeader(records, leaderEpoch = this.leaderEpoch, origin,
interBrokerProtocolVersion)
//我们可能需要增加HW,因为ISR可能降到1
(info, maybeIncrementLeaderHW(leaderLog))
case None =>
throw new NotLeaderOrFollowerException("Leader not local for partition %s on broker %d"
.format(topicPartition, localBrokerId))
}
}
// some delayed operations may be unblocked after HW changed
if (leaderHWIncremented)
tryCompleteDelayedRequests()
else {
// probably unblock some follower fetch requests since log end offset has been updated
delayedOperations.checkAndCompleteFetch()
}
info
}
/**
* Check if some delayed operations can be completed with the given watch key,
* and if yes complete them.
*
* @return the number of completed operations during this process
*/
def checkAndComplete(key: Any): Int = {
val wl = watcherList(key)
val watchers = inLock(wl.watchersLock) { wl.watchersByKey.get(key) }
val numCompleted = if (watchers == null)
0
else
watchers.tryCompleteWatched()
debug(s"Request key $key unblocked $numCompleted $purgatoryName operations")
numCompleted
}
这个回去通知各个follower,而follower处理逻辑就是:
override def run(): Unit = {
isStarted = true
info("Starting")
try {
while (isRunning)
doWork()
} catch {
case e: FatalExitError =>
shutdownInitiated.countDown()
shutdownComplete.countDown()
info("Stopped")
Exit.exit(e.statusCode())
case e: Throwable =>
if (isRunning)
error("Error due to", e)
} finally {
shutdownComplete.countDown()
}
info("Stopped")
}
override def doWork(): Unit = {
maybeTruncate()
maybeFetch()
}
private def maybeFetch(): Unit = {
val fetchRequestOpt = inLock(partitionMapLock) {
val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(partitionStates.partitionStateMap.asScala)
handlePartitionsWithErrors(partitionsWithError, "maybeFetch")
if (fetchRequestOpt.isEmpty) {
trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
}
fetchRequestOpt
}
fetchRequestOpt.foreach { case ReplicaFetch(sessionPartitions, fetchRequest) =>
processFetchRequest(sessionPartitions, fetchRequest)
}
}
LSO:特指LastStableOffset。它具体与kafka的事务有关。
消费端参数——isolation.level,这个参数用来配置消费者事务的隔离级别。字符串类型,“read_uncommitted”和“read_committed”,表示消费者所消费到的位置,如果设置为“read_committed",那么消费这就会忽略事务未提交的消息,既只能消费到LSO(LastStableOffset)的位置,默认情况下,”read_uncommitted",既可以消费到HW(High Watermak)的位置。
注:follower副本的事务隔离级别也为“read_uncommitted",并且不可修改。
在开启kafka事务的同时,生产者发送了若干消息,(msg1,msg2,)到broker中,如果生产者没有提交事务(执行CommitTransaction),那么对于isolation.level=read_committed的消费者而言是看不多这些消息的,而isolation.level=read_uncommitted则可以看到。事务中的第一条消息的位置可以标记为firstUnstableOffset(也就是msg1的位置)。
对每一个分区而言,它Lag等于HW-ConsumerOffset的值,其中ComsmerOffset表示当前的消费的位移,当然这只是针对普通的情况。如果为消息引入了事务,那么Lag的计算方式就会有所不同。
对于未完成的事务而言,LSO的值等于事务中的第一条消息所在的位置,(firstUnstableOffset)
对于已经完成的事务而言,它的值等同于HW相同,所以我们可以得出一个结论:LSO≤HW≤LEO。
对于分区中未完成的事务,并且消费者客户端的isolation.level参数配置为”read_committed"的情况,它对应的Lag等于LSO-ComsumerOffset的值。
OK,分布式相关的重点部分就说完了。