kafka保证partition级别的消息有序
kafka支持acks(0,1,all)三种级别
只要有一个replicas存活,已经committed的message就不会丢失。
consumer只能消费已经committed的message。
可靠的系统是需要用高吞吐,低延迟和硬件花费来交换的。
replication
每一个kafka topic都会被分区到partitions级别,partition作为一个基础的数据存储block。一个partition存在一个单独的disk中。kafka保证写入一个partition的消息有序。一个partition可以是online也可以使offline。每个partition可以有多个replicas。其中的一个被指定为leader。所有消息都是从leader读写。其他replicas只需要stay in sync with leader and replicate all the recent events on time。如果leader变为offline。in-sync的replicas中的一个会成为新的leader。
一个follower被视为in-sync则:和zookeeper有一个active的session。意味着它在过去的6秒内(可配置)有发送过heartbeat给zookeeper。在过去的10秒内(可配置)有fetch message from leader。过去10秒内有fetch最近的leader中的message,这意味着他不能有延迟。它若是与zoo失去连接,或停止fetch最新的message。或落后了10秒没fetch,都会被视为out-of-sync。重连需要它克服上述问题。这通常发生在网络暂时的拥挤。若是broker挂了,则需要很长时间了。java垃圾收集的配置不好也会导致broker pause for a few seconds。
一个in-sync的replica若是fetch数据有点落后,会导致producer和consumer的速度变慢。因为他们需要等待in-sync的replicas在它被视为committted之前fetch这些message。replication factor小会使得kafka更高效,但同时挂掉后数据丢失的概率也更高。
broker configuration
replication factor(broker level)
kafka的default value是3.factor为N允许你即使丢失N-1个brokers,依旧可以可靠的读取topic的数据。所以更高的factor会造就更可靠的系统。同时,factor为N要求你要至少要有N个broker。并且会储存N个数据的副本。我们这是在用硬件资源换可靠性。
一般来讲factor为3就够了,但银行都是用的5.factor为2的话,当一个broker挂掉了,会导致集群进入一个不稳定的状态(通常是老版本的kafka),强迫你重启另一个broker-kafka controller,这意味着你为了从operational issue中恢复过来会被强迫进入不可用状态。
unclean leader election(broker level)
default:true
clean:他保证所有committed data都不会丢失。通过定义,committed data存在于所有in-sync的replicas。
有两种场景:
1:factor为3,两个followers都挂了,这时leader为唯一一个in-sync的replica。所有写入leader中的数据都直接被视为commit成功。然后,若leader挂掉了,然后其中一个out-of-sync的follower启动了,我们就有一个out-of-sync的replica作为该partition的唯一一个可用的replica。
2:factor为3,由于网络问题,其中两个followers落后了。即使他们在追赶,在replicating,他们依旧不是in-sync。leader作为唯一一个in-sync的replica。若leader不可用。两个follower就永远都无法replicating到最新的消息也就永远无法达到in-sync。
然后就有两个选择
1:只允许in-sync的replica成为新的leader,这样就只能等待挂掉的leader恢复。
2:如果允许out-of-sync的replica成为新的leader。我们就会丢失一部分写入old leader的数据。选出新leader后,若原来的leader恢复了,在replicating数据时就会把那部分丢失的数据删掉。
银行等要求严格的通常是设为false。而实时点击流分析通常是设为true。
minimum in-sync replicas(N)
保证commited的数据至少写入了N个replica。当in-sync的replica不足N个时,这个tp就不会再接收produce请求。这个producer会收到NotEnoughReplicasException。
Using producers in a reliable system
以下两个场景
1:replica factor为3,unclean leader election为disabled,这意味着我们不会丢失kafka cluster中的committed message,然而,我们设置acks=1,消息发送给leader但还没和followers同步,leader就返回消息写入成功,然后在消息同步前这个leader挂掉了,其他的followers还没同步到这些在producer看来已经成功写入的消息。最后的结果是这些消息并没有commit成功,consumer也就无法消费这些消息,但在producer看来数据就已经丢失了。
2:replica factor为3,unclean leader election为disabled,经过上面的情况,我们设置acks=all,但若leader挂了,在重新选举leader时,发送的消息都会返回leader not available的response,若producer没有处理好这个response并重发直到发送成功,这个数据就丢失了。
为避免以上情况,合理的设置acks的值并处理errors response。
acks=0:producer只要发送了消息即被视为成功写入kafka,若发送的无法serialize或网络故障仍会get error。但会导致一定情况下的数据丢失,如leader正在选举时,该情况下速度非常快,你可以获得惊人的吞吐量。极大化利用你的带宽,但你得承受可能会丢失部分数据的代价。
acks=1:只有在消息写入leader中才被视为写入kafka成功,当leader正在选举时会返回leaderNotAvailableException。处理得当可以避免该情况下的数据丢失,但当写入leader后还未同步时leader挂了。未同步的数据仍会丢失。
acks=all:最安全的选择,也是最慢的,可以考虑通过异步发送的模式和提高发送的batch的大小来尽量获得更大的吞吐量。
producer retires
err分两类,retriable和nonretriable,例如leaderNotAvaiable是retriable,而invalid config是nonretriable。可以考虑重试或者写入其他地方然后手动处理这些写入失败的数据。
合理的retry机制只能保证at least once但不能保证exactly once,所以最好给message加一个unique identifier,来允许删除重复的消息。或者即使有多个重复的消息也不影响系统。
其他的一些err包括:nonretriable broker errors such as regarding message size,authorization errors,etc.
error that occur before the message was sent to the broker -for example :serialization errors.
producer用光了retry的次数,在retry时由于存储的message过多导致producer可用的memory不足。
Using consumers in a reliable system
通过上面的,消息被视为committed才可以被consumer读,而被视为committed的数据基本就不会丢失了,consumer这边要做的仅仅只是如何追踪定位消息读取的位置即offset。
通常consumer去fetch数据,检查上次fetch过来的batch数据的last offset,然后根据这个offset再去fetch下一个batch的数据。
当consumer停了,其他的consumer需要知道从什么地方pick up这个工作,所以consumer需要commit offset。而当consumer commit它读取了的消息的offset但却在完全处理完这些数据前就commit了这波数据的offset,这时候consumer停了,这种情况下会导致数据的丢失。
group.id:若两个consumer有同样的group.id,他们会分别读取一个topic的所有partitions的子集,所有的message会被一个group全部读取。
auto.offset.reset:控制当没有offset commited或者consumer请求一个broker中不存在的offset时,consumer的行为。有两个选择,earliest:consumer会从partition的开端开始读,无论它没有一个有效的offset。这可能会导致consuer重复处理大量的数据。但他会保证最小的数据丢失。latest:consumer会从partition的末端开始。这会最小化数据的重复处理,但会导致数据的丢失。
enable.auto.commit:或者在code中手动commit。如果在consumer的poll loop中处理了records。那么automatic offset commit可以保证你永远不会commit一个你没有处理过的record的offset。缺点是你无法控制重复处理的record(处理了一部分record的时候consumer停了)如果把record传到另一个thread来处理,auto会导致提交consumer读了但未处理的record的offset。
auto.commit.interval.ms:default为5秒,commit频率越高,可以降低重复处理的数据量。但性能会更差。
ALWAYS COMMIT OFFSETS AFTER EVENTS WERE PROCESSED
若处理#30失败但处理#31成功,不应该提交#31的offset。
当你遇到可重试的err。commit上一个成功处理的record的offset,然后将仍要处理的record放入buffer中,然后继续重试。可以使用pause()方法来保证不会带来更多的数据来让重试更加的容易。
当遇到可重试的err,还可以将它写入另一个分离的topic,另一个consumergroup可以用于处理这些重试,即原本的数据流和重试数据流两条数据通道。
存储处理的结果状态,如求平均数等。可以将平均数存入另一个topic。
要做到exactly once就加个unique key。
Sometimes processing records takes a long time. Maybe you are interacting with a service that can block or doing a very complex calculation, for example. Remember that in some versions of the Kafka consumer, you can’t stop polling for more than a few seconds (see Chapter 4 for details). Even if you don’t want to process additional records, you must continue polling so the client can send heartbeats to the broker. A common pattern in these cases is to hand off the data to a thread-pool when possible with multiple threads to speed things up a bit by processing in parallel. After handing off the records to the worker threads, you can pause the consumer and keep polling without actually fetching additional data until the worker threads finish. Once they are done, you can resume the consumer. Because the consumer never stops polling, the heartbeat will be sent as planned and rebalancing will not be triggered.
Validating Configuration
It is easy to test the broker and client configuration in isolation from the application logic, and it is recommended to do so for two reasons:
- It helps to test if the configuration you’ve chosen can meet your requirements.
- It is good exercise to reason through the expected behavior of the system. This chapter was a bit theoretical, so checking your understanding of how the theory applies in practice is important.
Kafka includes two important tools to help with this validation. The org.apache.kafka.tools
package includes VerifiableProducer
and VerifiableConsumer
classes. These can run as command-line tools, or be embedded in an automated testing framework.
The idea is that the verifiable producer produces a sequence of messages containing numbers from 1 to a value you choose. You can configure it the same way you configure your own producer, setting the right number of ack
s, retries, and rate at which the messages will be produced. When you run it, it will print success or error for each message sent to the broker, based on the ack
s received. The verifiable consumer performs the complementary check. It consumes events (usually those produced by the verifiable producer) and prints out the events it consumed in order. It also prints information regarding commits and rebalances.
You should also consider which tests you want to run. For example:
- Leader election: what happens if I kill the leader? How long does it take the producer and consumer to start working as usual again?
- Controller election: how long does it take the system to resume after a restart of the controller?
- Rolling restart: can I restart the brokers one by one without losing any messages?
- Unclean leader election test: what happens when we kill all the replicas for a partition one by one (to make sure each goes out of sync) and then start a broker that was out of sync? What needs to happen in order to resume operations? Is this acceptable?
Then you pick a scenario, start the verifiable producer, start the verifiable consumer, and run through the scenario—for example, kill the leader of the partition you are producing data into. If you expected a short pause and then everything to resume normally with no message loss, make sure the number of messages produced by the producer and the number of messages consumed by the consumer match.
The Apache Kafka source repository includes an extensive test suite. Many of the tests in the suite are based on the same principle—use the verifiable producer and consumer to make sure rolling upgrades work, for example.
Validating Applications
Once you are sure your broker and client configuration meet your requirements, it is time to test whether your application provides the guarantees you need. This will check things like your custom error-handling code, offset commits, and rebalance listeners and similar places where your application logic interacts with Kafka’s client libraries.
Naturally, because it is your application, there is only so much guidance we can provide on how to test it. Hopefully you have integration tests for your application as part of your development process. However you validate your application, we recommend running tests under a variety of failure conditions:
- Clients lose connectivity to the server (your system administrator can assist you in simulating network failures)
- Leader election
- Rolling restart of brokers
- Rolling restart of consumers
- Rolling restart of producers
For each scenario, you will have expected behavior, which is what you planned on seeing when you developed your application, and then you can run the test to see what actually happens. For example, when planning for a rolling restart of consumers, you may plan for a short pause as consumers rebalance and then continue consumption with no more than 1,000 duplicate values. Your test will show whether the way the application commits offsets and handles rebalances actually works this way.
Monitoring Reliability in Production
Testing the application is important, but it does not replace the need to continuously monitor your production systems to make sure data is flowing as expected. Chapter 9 will cover detailed suggestions on how to monitor the Kafka cluster, but in addition to monitoring the health of the cluster, it is important to also monitor the clients and the flow of data through the system.
First, Kafka’s Java clients include JMX metrics that allow monitoring client-side status and events. For the producers, the two metrics most important for reliability are error-rate and retry-rate per record (aggregated). Keep an eye on those, since error or retry rates going up can indicate an issue with the system. Also monitor the producer logs for errors that occur while sending events that are logged at WARN
level, and say something along the lines of “Got error produce response with correlation id 5689 on topic-partition [topic-1,3], retrying (two attempts left). Error: …”. If you see events with 0 attempts left, the producer is running out of retries. Based on the discussion in the section “Using Producers in a Reliable System”, you may want to increase the number of retries—or solve the problem that caused the errors in the first place.
On the consumer side, the most important metric is consumer lag. This metric indicates how far the consumer is from the latest message committed to the partition on the broker. Ideally, the lag would always be zero and the consumer will always read the latest message. In practice, because calling poll()
returns multiple messages and then the consumer spends time processing them before fetching more messages, the lag will always fluctuate a bit. What is important is to make sure consumers do eventually catch up rather than fall farther and farther behind. Because of the expected fluctuation in consumer lag, setting traditional alerts on the metric can be challenging. Burrow is a consumer lag checker by LinkedIn and can make this easier.
Monitoring the flow of data also means making sure all produced data is consumed in a timely manner (your requirements will dictate what “timely manner” means). In order to make sure data is consumed in a timely manner, you need to know when the data was produced. Kafka assists in this: starting with version 0.10.0, all messages include a timestamp that indicates when the event was produced. If you are running clients with an earlier version, we recommend recording the timestamp, name of the app producing the message, and hostname where the message was created, for each event. This will help track down sources of issues later on.
In order to make sure all produced messages are consumed within a reasonable amount of time, you will need the application producing the code to record the number of events produced (usually as events per second). The consumers need to both record the number of events consumed (also events per second) and also record lags from the time events were produced to the time they were consumed, using the event timestamp. Then you will need a system to reconcile the events per second numbers from both the producer and the consumer (to make sure no messages were lost on the way) and to make sure the time gaps between the time events were produced in a reasonable amount of time. For even better monitoring, you can add a monitoring consumer on critical topics that will count events and compare them to the events produced, so you will get accurate monitoring of producers even if no one is consuming the events at a given point in time. These type of end-to-end monitoring systems can be challenging and time-consuming to implement. To the best of our knowledge, there is no open source implementation of this type of system, but Confluent provides a commercial implementation as part of the Confluent Control Center.