文章目录

  • 6.kafka的Log存储原理解析
  • 1.解读阶段
  • 1.Producer发送ProduceRequest请求
  • i.produceRequest结构
  • ii.构建ProduceRequest
  • 2.broker收到Produce请求
  • i.处理ProduceRequest请求
  • ii.追加日志到文件管道
  • iii.管道内容刷盘
  • 2.原理阶段
  • 1.segment结构
  • i.log文件结构
  • ii.查看log内容
  • iii.查看index内容
  • 2.如何读取到某个message日志


6.kafka的Log存储原理解析

「kafka源码分析」kafka基础知识介绍中,我们知道了kafka是如何实现批量发送消息的,但是发送的消息之后到了哪里,怎么存储,怎么读取我们还是一头雾水,这个章节我们从消息发送到Broker开始出发。

1.获取topic的etadata信息,检测topic是否可用
2.key、value序列化
3.获取partition的值(可以重写partition()方法,默认采用hash取余的方式获取)
4.达到batch.size大小,唤起sender线程去发送RecordBatch
5.发送RecordBatch

1.解读阶段

1.Producer发送ProduceRequest请求

i.produceRequest结构
{
    "apiKey":0,//表示PRODUCE
    "type":"request",
    "listeners":["zkBroker","broker"],
    "name":"ProduceRequest",
    "validVersions":"0-9",
    "flexibleVersions":"9+",
    "fields":[
        {
            "name":"TransactionalId",
            "type":"string",
            "versions":"3+",
            "nullableVersions":"3+",
            "default":"null",
            "entityType":"transactionalId",
            "about":"The transactional ID, or null if the producer is not transactional."
        },
        {
            "name":"Acks",
            "type":"int16",
            "versions":"0+",
            "about":"The number of acknowledgments the producer requires the leader to have received before considering a request complete. Allowed values: 0 for no acknowledgments, 1 for only the leader and -1 for the full ISR."
        },
        {
            "name":"TimeoutMs",
            "type":"int32",
            "versions":"0+",
            "about":"The timeout to await a response in miliseconds."
        },
        {
            "name":"TopicData",
            "type":"[]TopicProduceData",
            "versions":"0+",
            "about":"Each topic to produce to.",
            "fields":[
                {
                    "name":"Name",
                    "type":"string",
                    "versions":"0+",
                    "entityType":"topicName",
                    "mapKey":true,
                    "about":"The topic name."
                },
                {
                    "name":"PartitionData",
                    "type":"[]PartitionProduceData",
                    "versions":"0+",
                    "about":"Each partition to produce to.",
                    "fields":[
                        {
                            "name":"Index",
                            "type":"int32",
                            "versions":"0+",
                            "about":"The partition index."
                        },
                        {
                            "name":"Records",
                            "type":"records",
                            "versions":"0+",
                            "nullableVersions":"0+",
                            "about":"The record data to be produced."
                        }
                    ]
                }
            ]
        }
    ]
}

根据需要发送的内容构建请求格式

ii.构建ProduceRequest

jmeter链接kafka_Data

我们可以看到真正要发送的内容是在 TopicData.PartitionData.Records里,这里的处理其实很简单。

2.broker收到Produce请求

jmeter链接kafka_Data_02

我们看一下服务端收到请求后,是怎么处理的

i.处理ProduceRequest请求
def handleProduceRequest(request: RequestChannel.Request): Unit = {
  val produceRequest = request.body[ProduceRequest]
  val requestSize = request.sizeInBytes
	// ...省略内容,如果设置了支持幂等/支持事务,做响应的校验处理。
    
  val unauthorizedTopicResponses = mutable.Map[TopicPartition, PartitionResponse]()
  val nonExistingTopicResponses = mutable.Map[TopicPartition, PartitionResponse]()
  val invalidRequestResponses = mutable.Map[TopicPartition, PartitionResponse]()
  val authorizedRequestInfo = mutable.Map[TopicPartition, MemoryRecords]()
  // cache the result to avoid redundant authorization calls
  val authorizedTopics = filterByAuthorized(request.context, WRITE, TOPIC,
    produceRequest.data().topicData().asScala)(_.name())

  produceRequest.data.topicData.forEach(topic => topic.partitionData.forEach { partition =>
    val topicPartition = new TopicPartition(topic.name, partition.index)
    // This caller assumes the type is MemoryRecords and that is true on current serialization
    // We cast the type to avoid causing big change to code base.
    // https://issues.apache.org/jira/browse/KAFKA-10698
    val memoryRecords = partition.records.asInstanceOf[MemoryRecords]
    if (!authorizedTopics.contains(topicPartition.topic))
      // 该topic 没有权限
      unauthorizedTopicResponses += topicPartition -> new PartitionResponse(Errors.TOPIC_AUTHORIZATION_FAILED)
    else if (!metadataCache.contains(topicPartition))
      nonExistingTopicResponses += topicPartition -> new PartitionResponse(Errors.UNKNOWN_TOPIC_OR_PARTITION)
    else
      try {
        ProduceRequest.validateRecords(request.header.apiVersion, memoryRecords)
        authorizedRequestInfo += (topicPartition -> memoryRecords)
      } catch {
        case e: ApiException =>
          invalidRequestResponses += topicPartition -> new PartitionResponse(Errors.forException(e))
      }
  })

  // the callback for sending a produce response
  // The construction of ProduceResponse is able to accept auto-generated protocol data so
  // KafkaApis#handleProduceRequest should apply auto-generated protocol to avoid extra conversion.
  // https://issues.apache.org/jira/browse/KAFKA-10730
  @nowarn("cat=deprecation")
  def sendResponseCallback(responseStatus: Map[TopicPartition, PartitionResponse]): Unit = {
    val mergedResponseStatus = responseStatus ++ unauthorizedTopicResponses ++ nonExistingTopicResponses ++ invalidRequestResponses
    var errorInResponse = false

    mergedResponseStatus.forKeyValue { (topicPartition, status) =>
      if (status.error != Errors.NONE) {
        errorInResponse = true
        debug("Produce request with correlation id %d from client %s on partition %s failed due to %s".format(
          request.header.correlationId,
          request.header.clientId,
          topicPartition,
          status.error.exceptionName))
      }
    }

    // Record both bandwidth and request quota-specific values and throttle by muting the channel if any of the quotas
    // have been violated. If both quotas have been violated, use the max throttle time between the two quotas. Note
    // that the request quota is not enforced if acks == 0.
    val timeMs = time.milliseconds()
    val bandwidthThrottleTimeMs = quotas.produce.maybeRecordAndGetThrottleTimeMs(request, requestSize, timeMs)
    val requestThrottleTimeMs =
      if (produceRequest.acks == 0) 0
      else quotas.request.maybeRecordAndGetThrottleTimeMs(request, timeMs)
    val maxThrottleTimeMs = Math.max(bandwidthThrottleTimeMs, requestThrottleTimeMs)
    if (maxThrottleTimeMs > 0) {
      request.apiThrottleTimeMs = maxThrottleTimeMs
      if (bandwidthThrottleTimeMs > requestThrottleTimeMs) {
        quotas.produce.throttle(request, bandwidthThrottleTimeMs, requestChannel.sendResponse)
      } else {
        quotas.request.throttle(request, requestThrottleTimeMs, requestChannel.sendResponse)
      }
    }

    // Send the response immediately. In case of throttling, the channel has already been muted.
    if (produceRequest.acks == 0) {
      // no operation needed if producer request.required.acks = 0; however, if there is any error in handling
      // the request, since no response is expected by the producer, the server will close socket server so that
      // the producer client will know that some error has happened and will refresh its metadata
      if (errorInResponse) {
        val exceptionsSummary = mergedResponseStatus.map { case (topicPartition, status) =>
          topicPartition -> status.error.exceptionName
        }.mkString(", ")
        info(
          s"Closing connection due to error during produce request with correlation id ${request.header.correlationId} " +
            s"from client id ${request.header.clientId} with ack=0\n" +
            s"Topic and partition to exceptions: $exceptionsSummary"
        )
        closeConnection(request, new ProduceResponse(mergedResponseStatus.asJava).errorCounts)
      } else {
        // Note that although request throttling is exempt for acks == 0, the channel may be throttled due to
        // bandwidth quota violation.
        sendNoOpResponseExemptThrottle(request)
      }
    } else {
      sendResponse(request, Some(new ProduceResponse(mergedResponseStatus.asJava, maxThrottleTimeMs)), None)
    }
  }

  def processingStatsCallback(processingStats: FetchResponseStats): Unit = {
    processingStats.forKeyValue { (tp, info) =>
      updateRecordConversionStats(request, tp, info)
    }
  }

  if (authorizedRequestInfo.isEmpty)
    sendResponseCallback(Map.empty)
  else {
    val internalTopicsAllowed = request.header.clientId == AdminUtils.AdminClientId

    // call the replica manager to append messages to the replicas
    replicaManager.appendRecords(
      timeout = produceRequest.timeout.toLong,
      requiredAcks = produceRequest.acks,
      internalTopicsAllowed = internalTopicsAllowed,
      origin = AppendOrigin.Client,
      entriesPerPartition = authorizedRequestInfo,
      responseCallback = sendResponseCallback,
      recordConversionStatsCallback = processingStatsCallback)

    // if the request is put into the purgatory, it will have a held reference and hence cannot be garbage collected;
    // hence we clear its data here in order to let GC reclaim its memory since it is already appended to log
    produceRequest.clearPartitionRecords()
  }
}

以上内容看不懂没有关系,大概内容就是校验

1.是否有topic权限

2.版本apiVersion是否>=3,并且Record是Records子类,magic==2,apiVersion<7 日志压缩方式不能是ZSTD

如果条件满足,这里会调用replicaManager#appendRecords()方法。

ii.追加日志到文件管道
replicaManager.appendRecords(
        timeout = produceRequest.timeout.toLong,
        requiredAcks = produceRequest.acks,
        internalTopicsAllowed = internalTopicsAllowed,
        origin = AppendOrigin.Client,
        entriesPerPartition = authorizedRequestInfo,
        responseCallback = sendResponseCallback,
        recordConversionStatsCallback = processingStatsCallback)

jmeter链接kafka_kafka_03

现在是不是觉得可以开始进入正题了呢?

jmeter链接kafka_jmeter链接kafka_04

jmeter链接kafka_jmeter链接kafka_05

minIsr来自配置min.insync.replicas,如果想要了解kafka最新配置脑图可以查看kafka配置脑图章节

这里有一点需要注意的是:当Producer将ack设置为-1,min.insync.replicas指定了被认为写入成功的最小副本数时,min.insync.replicas和acks允许最小值的acks表示成功,这可以提高效率以及可靠性

下面开始append到segment文件中

jmeter链接kafka_kafka_06

jmeter链接kafka_apache_07

jmeter链接kafka_jmeter链接kafka_08

这里我们看到了kafka写入到主存利用的是NIO的channel(FileChannelImpl)把ByteBuffer写入到内存中

iii.管道内容刷盘

jmeter链接kafka_kafka_09

jmeter链接kafka_kafka_10

此时,所有channel内的内容都刷到磁盘里了,这时才是真正的持久化到磁盘。

这里有一点要注意,log的flush是利用FileChannelImpl实现的,index相关的是利用mmap(内存映射)实现的

jmeter链接kafka_kafka_11

2.原理阶段

kafka的Message是以topic为基本单位组织的,不同的topic之间是相互独立的,每个topic分成多个partiton,每个peitition利用副本机制实现高可用,每个partition存储一部分Message,每个partition的底层存储是多个segment,如下图所示:

jmeter链接kafka_apache_12

1.segment结构

每个partition都会有多个segment结构,每个segment都包含log,index,timeindex三个文件

jmeter链接kafka_jmeter链接kafka_13

  • .log存储消息
  • .index存储消息的索引
  • .timeIndex,时间索引文件,通过时间戳做索引

每一个topic都会有这三个文件,1958的意思是offset从1958开始。这样设计查找会很方便根据log文件的内容

i.log文件结构

jmeter链接kafka_apache_14

ii.查看log内容

我们可以通过指令kafka-run-class.sh kafka.tools.DumpLogSegments --files /tmp/kafka-log/#{topic}/000000000000000000.log --print-data-log

jmeter链接kafka_jmeter链接kafka_15

iii.查看index内容

jmeter链接kafka_apache_16

2.如何读取到某个message日志

jmeter链接kafka_kafka_17

1.index文件里存储的是offset的区间范围,首先根据offset找到确定索引文件
2.索引文件中根据相对的offset找到对应的最近的索引记录,这个记录中有position(message的物理地址)
3.找到offset最近的log文件找到position与步骤2相同的position,顺序的往下扫描文件,找到offset=xxx的消息内容

注:索引文件是采用稀疏索引的方式,即加快了消息查找的速度,也顾及了存储的开销。