问题解决过程
线上一个界面发现老是没有数据,排查下来时生产者没有成功发送消息所致,报错如下:
org.springframework.kafka.core.KafkaProducerException: Failed to send; nested exception is org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 0 ms.
我们发现配置里有一个很奇怪的参数
config.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 0);
我们找到相关源码:
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
TopicPartition tp = null;
try {
throwIfProducerClosed();
// first make sure the metadata for the topic is available
ClusterAndWaitTime clusterAndWaitTime;
try {
// -------------------------maxBlockTimeMs最终会传入ProducerConfig.MAX_BLOCK_MS_CONFIG对应值0
clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);
.......
private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long maxWaitMs) throws InterruptedException {
....
do {
log.trace("Requesting metadata update for topic {}.", topic);
metadata.add(topic);
int version = metadata.requestUpdate();
sender.wakeup();
try {
metadata.awaitUpdate(version, remainingWaitMs);
} catch (TimeoutException ex) {
// ------------------主要是在这里报错了------------
// Rethrow with original maxWaitMs to prevent logging exception with remainingWaitMs
throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms.");
}
....
我们找到max.block.ms的作用
public static final String MAX_BLOCK_MS_CONFIG = "max.block.ms";
private static final String MAX_BLOCK_MS_DOC = "The configuration controls how long <code>KafkaProducer.send()</code> and <code>KafkaProducer.partitionsFor()</code> will block."
+ "These methods can be blocked either because the buffer is full or metadata unavailable."
+ "Blocking in the user-supplied serializers or partitioner will not be counted against this timeout.";
问题很清晰了, send时发现要更新metadata,更新metadata时0ms超时报错了!
故把该值改为1000即可
config.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 1000);
深入思考这个问题
这个功能是由界面点击时触发的,但运营人员很少会去使用这个功能,故猜测太长时间没用,比如超过9分钟(connections.max.idle.ms), 导致连接关闭,下次send时需要重新拉取metadata导致,但实际是这样的吗?no
我们把connections.max.idle.ms=1000,期待1s后send失败,但实际上还是需要空闲好几分钟后send才会失败,说明connections.max.idle.ms这个参数不是报错的直接原因!
通过DEBUG日志, 我们发现超时连接断开时,会触发metadata的更新;当metadata.max.age.ms超时时也会触发metadate更新。我们再深入研究下去
调试时发现抛出报错的日志前面,有如下所示日志
[2021-08-25 22:28:15,039] [kafka-producer-network-thread | producer-1] [DEBUG] org.apache.kafka.clients.Metadata - Removing unused topic InnerProcess from the metadata list, expiryMs 1629901694499 now 1629901695039
这句话是在下面这个地方打印的
org.apache.kafka.clients.Metadata#update
public static final long TOPIC_EXPIRY_MS = 5 * 60 * 1000;
public synchronized void update(Cluster newCluster, Set<String> unavailableTopics, long now) {
....
// producer默认为true,consumer默认为false
if (topicExpiryEnabled) {
// Handle expiry of topics from the metadata refresh set.
for (Iterator<Map.Entry<String, Long>> it = topics.entrySet().iterator(); it.hasNext(); ) {
Map.Entry<String, Long> entry = it.next();
long expireMs = entry.getValue();
// 第一次需要立即更新为-1
if (expireMs == TOPIC_EXPIRY_NEEDS_UPDATE)
// 第一次设置了5分钟后失效
entry.setValue(now + TOPIC_EXPIRY_MS);
else if (expireMs <= now) {
// 达到失效时间后,topic就会删除,下次send需要重新更新metadata了
it.remove();
log.debug("Removing unused topic {} from the metadata list, expiryMs {} now {}", entry.getKey(), expireMs, now);
}
}
}
.....
我们可以看到topic其实是设置了失效时间的,默认5分钟,当第一次调用send(),失效时间为-1,并且会触发更新metadata,即会修改为entry.setValue(now + TOPIC_EXPIRY_MS),5分钟后失效,但5分钟后不会立即失效,若等到9分钟连接断开时,会走到上面的代码:
if (expireMs <= now) {
// 达到失效时间后,topic就会删除,下次send需要重新更新metadata了
it.remove();
log.debug("Removing unused topic {} from the metadata list, expiryMs {} now {}", entry.getKey(), expireMs, now);
}
即会将topic移除了,故9分钟后再次send时,需要更新metadata,但max.block.ms=0导致超时报错,即发送失败!
我们可以做这样的试验, 修改下面2个参数
// 在topic失效之前停掉
config.put(ProducerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG, 4 * 60 * 1000);
// 10小时才更新一次metadata
config.put(ProducerConfig.METADATA_MAX_AGE_CONFIG, 10 * 60 * 60 * 1000);
然后每大于5分钟的时间间隔,send一次
@Scheduled(fixedDelay = 9 * 60 * 1000 + 10000)
public void sendKafkaASync() {
kafkaService.sendKafkaASync();
}
发现除了第一次失败后,就不会失败了!!!
因此,试验下来,是由于max.block.ms设置过小,导致更新metadata超时,导致send失败;更深入的:
1)首次发送没有topic数据,send调用时需要首先更新metadata时间不够,导致超时;
2)后面发送失败,是因为topic已超时,并在之后,connections.max.idle.ms或metadata.max.age.ms到时间触发metadata更新,而把超时的topic移除了。topic移除后,再进行send,有需要先更新metadata,进而还是会失败;
总体上而言,send的间隔不要超过5分钟最好,超过5分钟不调用send, 存在一定的可能导致topic被移除,从而下次send需要重新拉去metadata
附录:相关代码
pom.xml
<parent>
<artifactId>spring-boot-starter-parent</artifactId>
<groupId>org.springframework.boot</groupId>
<version>2.1.5.RELEASE</version>
</parent>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<exclusions>
<!-- logback to log4j2 -->
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-log4j2</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
<version>2.2.5.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.12</version>
</dependency>
</dependencies>
log4j2.xml
<?xml version="1.0" encoding="UTF-8"?>
<!-- status指定log4j本身的日志打印级别 -->
<!-- 配置文件发生修改,在monitorInterval时间即5s内,重新加载,无需重启 -->
<Configuration status="DEBUG" monitorInterval="5">
.....
<Loggers>
<Logger name="org.springframework" level="INFO" />
<Logger name="com.ydfind" level="DEBUG" />
<Logger name="org.apache.kafka.clients.consumer" level="WARN"/>
<Logger name="org.apache.kafka.clients.FetchSessionHandler" level="WARN" />
<Logger name="org.apache.kafka.clients.producer.KafkaProducer" level="TRACE" />
</Loggers>
</Configuration>
application.yml
logging:
config: classpath:log4j2.xml
KafkaConfig.java
package com.ydfind.config;
import org.apache.kafka.clients.CommonClientConfigs;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.common.config.SaslConfigs;
import org.apache.kafka.common.security.auth.SecurityProtocol;
import org.apache.kafka.common.security.scram.ScramLoginModule;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.StringSerializer;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.kafka.annotation.EnableKafka;
import org.springframework.kafka.config.ConcurrentKafkaListenerContainerFactory;
import org.springframework.kafka.config.KafkaListenerContainerFactory;
import org.springframework.kafka.core.DefaultKafkaConsumerFactory;
import org.springframework.kafka.core.DefaultKafkaProducerFactory;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.kafka.core.ProducerFactory;
import org.springframework.kafka.listener.ConcurrentMessageListenerContainer;
import org.springframework.web.context.request.FacesRequestAttributes;
import java.awt.datatransfer.StringSelection;
import java.util.HashMap;
import java.util.Map;
@Configuration
@EnableKafka
public class KafkaConfig {
@Bean
public KafkaTemplate<String, String> kafkaTemplate() {
return new KafkaTemplate<>(producerFactory());
}
private ProducerFactory<String, String> producerFactory() {
return new DefaultKafkaProducerFactory<>(producerConfigs());
}
private Map<String, Object> producerConfigs() {
Map<String, Object> config = new HashMap<>();
config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
//
// config.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, SecurityProtocol.SASL_PLAINTEXT.name);
// config.put(SaslConfigs.SASL_JAAS_CONFIG, "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"xxx\" password=\"xxx\"");
// config.put(SaslConfigs.SASL_MECHANISM, "SCRAM-SHA-256");
config.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 0);
// config.put(ProducerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG, 1 * 60 * 1000);
// config.put(ProducerConfig.METADATA_MAX_AGE_CONFIG, 30 * 60 * 1000);
// config.put(ProducerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG, 100 * 60 * 1000);
// config.put(ProducerConfig.METADATA_MAX_AGE_CONFIG, 1 * 60 * 1000);
// ----貌似下面会10小时内都成功!!!!---------------------------
// 4min时connect断掉,5min时topic失效,6min时send()成功,10min时把topic失效时间弄从15min!!!!!
// 是send()导致entry.getValue() == TOPIC_EXPIRY_NEEDS_UPDATE,还是connect断开的更新topic导致呢!!!
// topics.put(topic, TOPIC_EXPIRY_NEEDS_UPDATE)-------------send时会把topic的失效时间弄成-1----相当于续命,不会失效
// 在topic失效之前停掉
config.put(ProducerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG, 4 * 60 * 1000);
// 10小时才更新一次metadata
config.put(ProducerConfig.METADATA_MAX_AGE_CONFIG, 10 * 60 * 60 * 1000);
return config;
}
}
KafkaService
package com.ydfind.service;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.kafka.support.SendResult;
import org.springframework.stereotype.Component;
import org.springframework.util.concurrent.ListenableFuture;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.atomic.AtomicInteger;
@Component
@Slf4j
public class KafkaService {
@Autowired
private KafkaTemplate<String, String> kafkaTemplate;
public String sendKafkaSync() {
ListenableFuture<SendResult<String, String>> send = kafkaTemplate.send("InnerProcess", "sendKafkaSync");
String msg = "";
try {
SendResult<String, String> stringStringSendResult = send.get();
msg += stringStringSendResult.getRecordMetadata().offset();
} catch (InterruptedException e) {
e.printStackTrace();
return e.getMessage();
} catch (ExecutionException e) {
e.printStackTrace();
return e.getMessage();
}
return msg;
}
private AtomicInteger total = new AtomicInteger();
private AtomicInteger success = new AtomicInteger();
public void sendKafkaASync() {
ListenableFuture<SendResult<String, String>> send = kafkaTemplate.send("InnerProcess", "sendKafkaSync");
send.completable().whenCompleteAsync((n, e) -> {
total.getAndIncrement();
if (null != e) {
e.printStackTrace();
} else {
success.getAndIncrement();
}
log.info("\n-------------------total = {}, success = {}", total.get(), success.get());
});
}
}
MainApp.java
package com.ydfind;
import com.ydfind.service.KafkaService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
@SpringBootApplication
@RestController
@EnableScheduling
public class MainApp {
public static void main(String[] args) {
SpringApplication.run(MainApp.class, args);
}
@Autowired
private KafkaService kafkaService;
@GetMapping("sendKafka")
public String sendKafka() {
return kafkaService.sendKafkaSync();
}
@Scheduled(fixedDelay = 9 * 60 * 1000 + 10000)
public void sendKafkaASync() {
kafkaService.sendKafkaASync();
}
}