先抛几个简单问题,1问, 4个topic,每个topic 5个分区,问并行度10 ,这个并行度是怎么划分这些topic 分区的。2问,topic 分区 动态更新怎么做的。3问,就1问中的tm 是怎么产生的?

省流版,先总结。Flink 中kafka 作为Source源头,首先会开始一个SourceCoordinator来与Kafka联系获取所有topic分区,同时兼顾新增tp(topic parition)的检测,在根据并行度,根据一个规则(等下用源码展示这个规则)来切分tp,然后,Flink 在并行度work中会开启SourceOperator,并向SourceCoordinator发送注册请求,要求获取split tp,用来后续的消费kafka数据。其中会创建KafkaSourceReader,该对象主要是用来创建KafkaPartitionSplitReader,以及SplitState的管理。而KafkaPartitionSplitReader就是最实际用来与Kafka建立consumer来消费数据的。

先贴上规则,见以下代码,可以明显看出,同一个topic,同属于一个startIndex,但是会根据不同的partion,又被切分到不同的地方。
那就可以回答问题1,同一个topic的5个分区要平分到并行度为10的work内,如果,该topic的startIndex=0;那么这5个分区依次分到0、1、2、3、4 的work上,进行数据获取。所以,这个并不会存在一个KafkaConsumer 消费 同一个topic的5个分区,但是有可能存在 一个KafkaConsumer 消费不同topic的不同分区,因为startIndex是不定的,partion会重叠划分到相同work上。如果,并行度<5,才会出现同一个KafkaConsumer 消费同一个topic的不同分区。这里有个前提是不同的work 代表不同的KafkaConsumer,这是肯定的,因为不同的work,就意味的并行度。

static int getSplitOwner(TopicPartition tp, int numReaders) {
        int startIndex = ((tp.topic().hashCode() * 31) & 0x7FFFFFFF) % numReaders;

        // here, the assumption is that the id of Kafka partitions are always ascending
        // starting from 0, and therefore can be used directly as the offset clockwise from the
        // start index
        return (startIndex + tp.partition()) % numReaders;
    }

接着回答问题二,新的分区等 动态更新是从哪里来,从SourceCoordinator来,这才是大脑。当然也是需要work 做协同配合的工作。
从代码中就可以看出partitionDiscoveryIntervalMs,这个参数,就是如果其>0,那就定时和kafka 联系看看是否有新增partition 等。

public void start() {
        consumer = getKafkaConsumer();
        adminClient = getKafkaAdminClient();
        if (partitionDiscoveryIntervalMs > 0) {
            LOG.info(
                    "Starting the KafkaSourceEnumerator for consumer group {} "
                            + "with partition discovery interval of {} ms.",
                    consumerGroupId,
                    partitionDiscoveryIntervalMs);
                    //关键在这里,如果partitionDiscoveryIntervalMs>0,这里就是一个定时服务了。
            context.callAsync(
            // 这里是发现多少个分区,并且进行划分
                    this::discoverAndInitializePartitionSplit,
                    // 这里就是要操作了
                    this::handlePartitionSplitChanges,
                    0,
                    partitionDiscoveryIntervalMs);
        } else {
            LOG.info(
                    "Starting the KafkaSourceEnumerator for consumer group {} "
                            + "without periodic partition discovery.",
                    consumerGroupId);
            context.callAsync(
                    () -> {
                        try {
                            return discoverAndInitializePartitionSplit();
                        } finally {
                            // Close the admin client early because we won't use it anymore.
                            adminClient.close();
                        }
                    },
                    this::handlePartitionSplitChanges);
        }
    }

问题三,watermark的产生,如果一个work上,有不同分区存在,那么该watermark 怎么产生,从源码揭示,根据分区来,再从分区中选出最小的watermark,作为这个work上的watermark,然后broadcast 到下游。

接下来就是源码部分了,这部分针对喜欢看源码的读者。前提,这里也只会拎主线,不会全贴代码。

第一个SourceCoordinator
SourceCoordinator 运行在JobMaster 上,可以和其它work进行通信,同时开启KafkaSourceEnumerator

再来细看KafkaSourceEnumerator重点功能方法,首先要获取所有的新增topic partitions ( 简写:tp ),然后在将tp根据 上述代码规划,进行切分,最后就是根据注册过来的work(work个数对应并行度个数),将切分信息rpc到对应的work上。

// 发现并初始化分区分割
 private PartitionSplitChange discoverAndInitializePartitionSplit() {
        // Make a copy of the partitions to owners
        // 通过订阅获取所有tp
        KafkaSubscriber.PartitionChange partitionChange =
                subscriber.getPartitionChanges(
                        adminClient, Collections.unmodifiableSet(discoveredPartitions));
          //分别获取分区、offset
        Set<TopicPartition> newPartitions =   Collections.unmodifiableSet(partitionChange.getNewPartitions());
        OffsetsInitializer.PartitionOffsetsRetriever offsetsRetriever = getOffsetsRetriever();

        Map<TopicPartition, Long> startingOffsets =
                startingOffsetInitializer.getPartitionOffsets(newPartitions, offsetsRetriever);
        Map<TopicPartition, Long> stoppingOffsets =
                stoppingOffsetInitializer.getPartitionOffsets(newPartitions, offsetsRetriever);

        Set<KafkaPartitionSplit> partitionSplits = new HashSet<>(newPartitions.size());
        for (TopicPartition tp : newPartitions) {
            Long startingOffset = startingOffsets.get(tp);
            long stoppingOffset =
                    stoppingOffsets.getOrDefault(tp, KafkaPartitionSplit.NO_STOPPING_OFFSET);
            partitionSplits.add(new KafkaPartitionSplit(tp, startingOffset, stoppingOffset));
        }
        discoveredPartitions.addAll(newPartitions);
        return new PartitionSplitChange(partitionSplits, partitionChange.getRemovedPartitions());
    }

    // This method should only be invoked in the coordinator executor thread.
    private void handlePartitionSplitChanges(
            PartitionSplitChange partitionSplitChange, Throwable t) {
        if (t != null) {
            throw new FlinkRuntimeException("Failed to handle partition splits change due to ", t);
        }
        if (partitionDiscoveryIntervalMs < 0) {
            LOG.debug("Partition discovery is disabled.");
            noMoreNewPartitionSplits = true;
        }
        // TODO: Handle removed partitions.
        addPartitionSplitChangeToPendingAssignments(partitionSplitChange.newPartitionSplits);
        assignPendingPartitionSplits(context.registeredReaders().keySet());
    }

    // This method should only be invoked in the coordinator executor thread.
    private void addPartitionSplitChangeToPendingAssignments(
            Collection<KafkaPartitionSplit> newPartitionSplits) {
        int numReaders = context.currentParallelism();
        for (KafkaPartitionSplit split : newPartitionSplits) {
            int ownerReader = getSplitOwner(split.getTopicPartition(), numReaders);
            //这部分就是根据注册的work,来切分tp,
            pendingPartitionSplitAssignment
                    .computeIfAbsent(ownerReader, r -> new HashSet<>())
                    .add(split);
        }
        LOG.debug(
                "Assigned {} to {} readers of consumer group {}.",
                newPartitionSplits,
                numReaders,
                consumerGroupId);
    }

    // This method should only be invoked in the coordinator executor thread.
    private void assignPendingPartitionSplits(Set<Integer> pendingReaders) {
        Map<Integer, List<KafkaPartitionSplit>> incrementalAssignment = new HashMap<>();

        // Check if there's any pending splits for given readers
        for (int pendingReader : pendingReaders) {
            checkReaderRegistered(pendingReader);

            // Remove pending assignment for the reader
            final Set<KafkaPartitionSplit> pendingAssignmentForReader =
                    pendingPartitionSplitAssignment.remove(pendingReader);

            if (pendingAssignmentForReader != null && !pendingAssignmentForReader.isEmpty()) {
                // Put pending assignment into incremental assignment
                incrementalAssignment
                        .computeIfAbsent(pendingReader, (ignored) -> new ArrayList<>())
                        .addAll(pendingAssignmentForReader);

                // Make pending partitions as already assigned
                pendingAssignmentForReader.forEach(
                        split -> assignedPartitions.add(split.getTopicPartition()));
            }
        }

        // Assign pending splits to readers
        if (!incrementalAssignment.isEmpty()) {
            LOG.info("Assigning splits to readers {}", incrementalAssignment);
           //这里是关键,这里代表着将切分好的分区rpc到对应的work上
            context.assignSplits(new SplitsAssignment<>(incrementalAssignment));
        }

        // If periodically partition discovery is disabled and the initializing discovery has done,
        // signal NoMoreSplitsEvent to pending readers
        if (noMoreNewPartitionSplits) {
            LOG.debug(
                    "No more KafkaPartitionSplits to assign. Sending NoMoreSplitsEvent to reader {}"
                            + " in consumer group {}.",
                    pendingReaders,
                    consumerGroupId);
            pendingReaders.forEach(context::signalNoMoreSplits);
        }
    }
//context.assignSplits(...)调用关键代码。
                   assignment
                            .assignment()
                            .forEach(
                                    (id, splits) -> {
                                        final OperatorCoordinator.SubtaskGateway gateway =
                                                getGatewayAndCheckReady(id);

                                        final AddSplitEvent<SplitT> addSplitEvent;
                                        try {
                                            addSplitEvent =
                                                    new AddSplitEvent<>(splits, splitSerializer);
                                        } catch (IOException e) {
                                            throw new FlinkRuntimeException(
                                                    "Failed to serialize splits.", e);
                                        }

                                        gateway.sendEvent(addSplitEvent);
                                    });
                    return null;
                },
                String.format("Failed to assign splits %s due to ", assignment));

好了,今天先到这,下次再些work 部分。