consumer作为kafka当中一个重要元素,它的常用操作并不复杂,说白了无非就是2点,1、把数据poll出来,2、把位置标记上。我们找到kafka的java api doc,找到了官方提供的几种consumer操作的例子,逐一进行分析,看看都有几种操作类型。
Automatic Offset Committing
自动 Offset 提交
This example demonstrates a simple usage of Kafka's consumer api that relying on automatic offset committing.
这个例子显示了一个基于offset自动提交的consumer api的简单应用。
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(), record.value());
}
Setting enable.auto.commit
means that offsets are committed automatically with a frequency controlled by the config auto.commit.interval.ms
.
enable.auto.commit
意味着offset将会得到自动提交,而这个自动提交的时间间隔由 auto.commit.interval.ms
来进行控制。
The connection to the cluster is bootstrapped by specifying a list of one or more brokers to contact using the configuration bootstrap.servers
. This list is just used to discover the rest of the brokers in the cluster and need not be an exhaustive list of servers in the cluster (though you may want to specify more than one in case there are servers down when the client is connecting).
客户端通过 bootstrap.servers
的配置来连接服务器,这个配值当中可以是一个或多个broker,需要注意的是,这个配置仅仅用来让客户端找到我们的server集群,而不需要把集群当中的所有服务器地址都列上。
In this example the client is subscribing to the topics foo and bar as part of a group of consumers called test as described above.
在这个例子当中,客户端作为test group的一员,订阅了foo和bar2个topic。
The broker will automatically detect failed processes in the test group by using a heartbeat mechanism. The consumer will automatically ping the cluster periodically, which lets the cluster know that it is alive. Note that the consumer is single-threaded, so periodic heartbeats can only be sent when poll(long) is called. As long as the consumer is able to do this it is considered alive and retains the right to consume from the partitions assigned to it. If it stops heartbeating by failing to call poll(long) for a period of time longer than session.timeout.ms
then it will be considered dead and its partitions will be assigned to another process.
( 这一段直接翻译很蹩脚,我会试着根据自己的理解翻译出来)首先假设,foo和bar这2个topic,都分别有3个partitions,同时我们将上面的代码在我们的机器上起3个进程,也就是说,在test group当中,目前有了3个consumer,一般来讲,这3个consumer会分别获得 foo和bar 的各一个partitions,这是前提。3个consumer会周期性的执行一个poll的动作(这个动作当中隐含的有一个heartbeat的发送,来告诉cluster我是活的),这样3个consumer会持续的保有他们对分配给自己的partition的访问的权利,如果某一个consumer失效了,也就是poll不再执行了,cluster会在一段时间( session.timeout.ms
)之后把partitions分配给其他的consumer。
The deserializer settings specify how to turn bytes into objects. For example, by specifying string deserializers, we are saying that our record's key and value will just be simple strings.
反序列化的设置,定义了如何转化bytes,这里我们把key和value都直接转化为string。
Manual Offset Control
手动的offset控制
Instead of relying on the consumer to periodically commit consumed offsets, users can also control when messages should be considered as consumed and hence commit their offsets.
除了周期性的自动提交offset之外,用户也可以在消息被消费了之后提交他们的offset。
This is useful when the consumption of the messages are coupled with some processing logic and hence a message should not be considered as consumed until it is completed processing.
某些情况下,消息的消费是和某些处理逻辑相关联的,我们可以用这样的方式,手动的在处理逻辑结束之后提交offset。
In this example we will consume a batch of records and batch them up in memory, when we have sufficient records batched we will insert them into a database. If we allowed offsets to auto commit as in the previous example messages would be considered consumed after they were given out by the consumer, and it would be possible that our process could fail after we have read messages into our in-memory buffer but before they had been inserted into the database. To avoid this we will manually commit the offsets only once the corresponding messages have been inserted into the database. This gives us exact control of when a message is considered consumed.
简要地说,在这个例子当中,我们希望每次至少消费200条消息并将它们插入数据库,之后再提交offset。如果仍然使用前面的自动提交方式,就可能出现消息已经被消费,但是插入数据库失败的情况。这里可以视作一个简单的事务封装。
This raises the opposite possibility: the process could fail in the interval after the insert into the database but before the commit (even though this would likely just be a few milliseconds, it is a possibility). In this case the process that took over consumption would consume from last committed offset and would repeat the insert of the last batch of data.
但是,有没有另一种可能性,在插入数据库成功之后,提交offset之前,发生了错误,或者说是提交offset本身发生了错误,那么就可能出现某些消息被重复消费的情况。
Used in this way Kafka provides what is often called "at-least once delivery" guarantees, as each message will likely be delivered one time but in failure cases could be duplicated.
个人认为这段话说的莫名其妙,简单地说,采用这样的方式,消息不会被丢失,但是有可能出现重复消费。
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "false");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
final int minBatchSize = 200;
List<ConsumerRecord<String, String>> buffer = new ArrayList<>();
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
buffer.add(record);
}
if (buffer.size() >= minBatchSize) {
insertIntoDb(buffer);
consumer.commitSync();
buffer.clear();
}
}
The above example uses commitSync to mark all received messages as committed. In some cases you may wish to have even finer control over which messages have been committed by specifying an offset explicitly. In the example below we commit offset after we finish handling the messages in each partition.
上面的例子当中,我们用commitSync来标记所有的消息;在有些情况下,我们可能希望更加精确的控制offset,那么在下面的例子当中,我们可以在每一个partition当中分别控制offset的提交。
try {
while(running) {
ConsumerRecords<String, String> records = consumer.poll(Long.MAX_VALUE);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
}
} finally {
consumer.close();
}
Note: The committed offset should always be the offset of the next message that your application will read. Thus, when calling commitSync(offsets) you should add one to the offset of the last message processed.
注意:提交的offset应该是next message,所以,提交的时候需要在当前最后一条的基础上+1.
Manual Partition Assignment
手动的分区分配
In the previous examples, we subscribed to the topics we were interested in and let Kafka dynamically assign a fair share of the partitions for those topics based on the active consumers in the group. However, in some cases you may need finer control over the specific partitions that are assigned. For example:
前面的例子当中,我们订阅一个topic,然后让kafka把该topic当中的不同partitions,公平的在一个consumer group内部进行分配。那么,在某些情况下,我们希望能够具体的指定partitions的分配关系。
- If the process is maintaining some kind of local state associated with that partition (like a local on-disk key-value store), then it should only get records for the partition it is maintaining on disk.
- 如果某个进程在本地管理了和partition相关的状态,那么它只需要获得跟他相关partition。
- If the process itself is highly available and will be restarted if it fails (perhaps using a cluster management framework like YARN, Mesos, or AWS facilities, or as part of a stream processing framework). In this case there is no need for Kafka to detect the failure and reassign the partition since the consuming process will be restarted on another machine.
- 如果某个进程自身具备高可用性,那么就不需要kafka来检测错误并重新分配partition,因为消费者进程会在另一台设备上重新启动。
To use this mode, instead of subscribing to the topic using subscribe, you just call assign(Collection) with the full list of partitions that you want to consume.
要使用这种模式,可以用assign方法来代替subscribe,具体指定一个partitions列表。
String topic = "foo";
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0, partition1));
Once assigned, you can call poll in a loop, just as in the preceding examples to consume records. The group that the consumer specifies is still used for committing offsets, but now the set of partitions will only change with another call to assign. Manual partition assignment does not use group coordination, so consumer failures will not cause assigned partitions to be rebalanced. Each consumer acts independently even if it shares a groupId with another consumer. To avoid offset commit conflicts, you should usually ensure that the groupId is unique for each consumer instance.
分配之后,就可以像前面的例子一样,在循环当中调用poll来消费消息。手动的分区分配不需要组协调,所以消费进程失效之后,不会引发partition的重新分配,每一个消费者都是独立工作的,即使它和其他消费者属于同一个group。为了避免offset提交的冲突,在这种情况下,通常我们需要保证每一个consumer使用自己的group id。
Note that it isn't possible to mix manual partition assignment (i.e. using assign) with dynamic partition assignment through topic subscription (i.e. using subscribe).
需要注意的是,手动partition分配和通过subscribe实现的动态的分区分配,2种方式是不能混合使用的。