最近在研究kafka,本着先理清框架脉络,再看细节实现的想法,先抱着文档一阵猛看,本来以为Coordinator和Controller的流程基本一样,选举一个Coordinator为主来接收Consumer的分配。哪知后来看了下源码,坑爹呢,选举去哪了:
KafkaServer.scala
/* start kafka coordinator */
consumerCoordinator = GroupCoordinator.create(config, zkUtils, replicaManager)
consumerCoordinator.startup()
GroupCoordinator.scala
/**
* Startup logic executed at the same time when the server starts up.
*/
def startup() {
info("Starting up.")
heartbeatPurgatory = new DelayedOperationPurgatory[DelayedHeartbeat]("Heartbeat", brokerId)
joinPurgatory = new DelayedOperationPurgatory[DelayedJoin]("Rebalance", brokerId)
isActive.set(true)
info("Startup complete.")
}
Coordinator是kafka负责consumer负载均衡,也就是你所订阅的Topic的Partition由哪个consumer消费的分配事项。具体介绍请参考以下篇文章:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
具体来说,Coordinator方面,由Consumer根据之前获得的Topic的Metadata信息,向服务端发起GroupCoordinatorRequest请求,服务端收到此请求后在KafkaApi.scala中进行处理:
def handleGroupCoordinatorRequest(request: RequestChannel.Request) {
... ...
val partition = coordinator.partitionFor(groupCoordinatorRequest.groupId)
// get metadata (and create the topic if necessary)
val offsetsTopicMetadata = getTopicMetadata(Set(GroupCoordinator.GroupMetadataTopicName), request.securityProtocol).head
val coordinatorEndpoint = offsetsTopicMetadata.partitionsMetadata.find(_.partitionId == partition).flatMap {
partitionMetadata => partitionMetadata.leader
}
val responseBody = coordinatorEndpoint match {
case None =>
new GroupCoordinatorResponse(Errors.GROUP_COORDINATOR_NOT_AVAILABLE.code, Node.noNode())
case Some(endpoint) =>
new GroupCoordinatorResponse(Errors.NONE.code, new Node(endpoint.id, endpoint.host, endpoint.port))
}
... ...
}
}
def partitionFor(groupId: String): Int = Utils.abs(groupId.hashCode) % groupMetadataTopicPartitionCount
这样就清楚了,上述算法中获取到的Partition的leader所在服务器的Coordinator负责本次请求的Consumer group的负载均衡管理。
为何上述最后提到了负载均衡“管理”一词,而不是分配,是因为最终consumer消费partition的分配不是在Coordinator端实现的。在第四步中,Consumer加入Coordinator时,其中最先加入且存活的Consumer成为该group的leader,由这个leader在第五步中负责具体的分配实现:
AbstractCoordinator.scala
private class JoinGroupResponseHandler extends CoordinatorResponseHandler<JoinGroupResponse, ByteBuffer> {
... ...
@Override
public void handle(JoinGroupResponse joinResponse, RequestFuture<ByteBuffer> future) {
// process the response
short errorCode = joinResponse.errorCode();
if (errorCode == Errors.NONE.code()) {
... ...
if (joinResponse.isLeader()) {
onJoinLeader(joinResponse).chain(future);
} else {
onJoinFollower().chain(future);
}
}
... ...
}
}
在onJoinLeader中:
private RequestFuture<ByteBuffer> onJoinLeader(JoinGroupResponse joinResponse) {
try {
// perform the leader synchronization and send back the assignment for the group
Map<String, ByteBuffer> groupAssignment = performAssignment(joinResponse.leaderId(), joinResponse.groupProtocol(),
... ...
} catch (RuntimeException e) {
return RequestFuture.failure(e);
}
}
protected Map<String, ByteBuffer> performAssignment(String leaderId, String assignmentStrategy, Map<String, ByteBuffer> allSubscriptions) { PartitionAssignor assignor = lookupAssignor(assignmentStrategy); ... ... Map<String, Assignment> assignment = assignor.assign(metadata.fetch(), subscriptions); ... ... }
assignor的具体实现可以在consumer中配置 partition.assignment.strategy,默认是RangeAssignor,具体分配策略如下:
假设有两个Consumer C0和C1,两个Topic T0和T1,每个Topic有3个Partition,获取的Partition列表将是t0p0、t0p1、t0p2、t1p0、t1p1、t1p2,得到的分配结果为:
C0: [t0p0, t0p1, t1p0, t1p1]
C1: [t0p2, t1p2]
Consumer follower也发送 SyncGroupRequest同步此具体的分配信息,在leader提交分配信息前, Coordinator会直阻塞follower的请求 。
到此,Coordinator的负载均衡实现就分析完了,Consumer拿到分配信息后如图第7、8步,开始消费,当Consumer订阅的Topic中任何Consumer的变动发生(接入、释放)都将触发新一轮的负载均衡。