一、Kafka连接超时异常

Kafka版本是2.7.1。Kafka客户端报错部分信息:

1 Exception in thread "main" java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=listNodes, deadlineMs=1636366763102, tries=1, nextAllowedTryMs=1636366763208) timed out at 1636366763108 after 1 attempt(s)
2     at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
3     at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
4     at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
5     at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
6     at com.xxx.demo.kafka.KafkaApi.main(KafkaApi.java:21)
7 Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=listNodes, deadlineMs=1636366763102, tries=1, nextAllowedTryMs=1636366763208) timed out at 1636366763108 after 1 attempt(s)
8 Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: listNodes

报错信息表示的很明白,是超时的异常。

 

二、排除常见的问题

首先,我的Kafka的服务是跑在IaaS上的,出现连接超时,一般先想到的是防火墙问题,没有开放端口,接下来是在本地使用Telnet命令来检测服务器的ip和port,telnet能通,又去查看了端口开放情况,排除了防火墙问题。而且服务器上部署都没有问题,也能生产和消费信息。

 

三、从客户端源码进行探究

确认服务端的Kafka版本和客户端的版本是否一致,版本号都是2.7.1

1         <dependency>
2             <groupId>org.apache.kafka</groupId>
3             <artifactId>kafka-clients</artifactId>
4             <version>2.7.1</version>
5         </dependency>

java 连接kafka集群配置 java连接kafka超时_java 连接kafka集群配置

 

创建Kafka客户端时,实例化一个Selector实例,这个实例内部打开了一个Java NIO的Selector。

static KafkaAdminClient createInternal(AdminClientConfig config, TimeoutProcessorFactory timeoutProcessorFactory) {
    ...
    selector = new Selector(config.getLong(AdminClientConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG),
                    metrics, time, metricGrpPrefix, channelBuilder, logContext);
    ...
}
public Selector(int maxReceiveSize,
            long connectionMaxIdleMs,
            int failedAuthenticationDelayMs,
            Metrics metrics,
            Time time,
            String metricGrpPrefix,
            Map<String, String> metricTags,
            boolean metricsPerConnection,
            boolean recordTimePerConnection,
            ChannelBuilder channelBuilder,
            MemoryPool memoryPool,
            LogContext logContext) {
        try {
            this.nioSelector = java.nio.channels.Selector.open();
        } catch (IOException e) {
            throw new KafkaException(e);
        }
        ...
    }

Kafka客户端实例好之后,客户端线程就启动了,while不断循环去处理请求。

@Override
        public void run() {
            log.trace("Thread starting");
            try {
                processRequests();
            } finally {
                ...
            }
    }

接着,主程序调用客户端的 describeCluster

@Override
    public DescribeClusterResult describeCluster(DescribeClusterOptions options) {
        final KafkaFutureImpl<Collection<Node>> describeClusterFuture = new KafkaFutureImpl<>();
        final KafkaFutureImpl<Node> controllerFuture = new KafkaFutureImpl<>();
        final KafkaFutureImpl<String> clusterIdFuture = new KafkaFutureImpl<>();
        final KafkaFutureImpl<Set<AclOperation>> authorizedOperationsFuture = new KafkaFutureImpl<>();

        final long now = time.milliseconds();
        runnable.call(new Call("listNodes", calcDeadlineMs(now, options.timeoutMs()),
            new LeastLoadedNodeProvider()) {

            @Override
            MetadataRequest.Builder createRequest(int timeoutMs) {
                // Since this only requests node information, it's safe to pass true for allowAutoTopicCreation (and it
                // simplifies communication with older brokers)
                return new MetadataRequest.Builder(new MetadataRequestData()
                    .setTopics(Collections.emptyList())
                    .setAllowAutoTopicCreation(true)
                    .setIncludeClusterAuthorizedOperations(options.includeAuthorizedOperations()));
            }
            ...
    }

这个新的Call就会被客户端线程复制到pendingCall进行处理,为这个Call选择一个节点进行处理。

1                 // Choose nodes for our pending calls.
 2                 pollTimeout = Math.min(pollTimeout, maybeDrainPendingCalls(now));
 3                 long metadataFetchDelayMs = metadataManager.metadataFetchDelayMs(now);
 4                 if (metadataFetchDelayMs == 0) {
 5                     metadataManager.transitionToUpdatePending(now);
 6                     Call metadataCall = makeMetadataCall(now);
 7                     // Create a new metadata fetch call and add it to the end of pendingCalls.
 8                     // Assign a node for just the new call (we handled the other pending nodes above).
 9 
10                     if (!maybeDrainPendingCall(metadataCall, now))
11                         pendingCalls.add(metadataCall);
12                 }

选择具有最少未完成请求且至少符合连接条件的节点,首先会去测试能不能连接。

1     @Override
 2     public Node leastLoadedNode(long now) {
 3         List<Node> nodes = this.metadataUpdater.fetchNodes();
 4         if (nodes.isEmpty())
 5             throw new IllegalStateException("There are no nodes in the Kafka cluster");
 6         int inflight = Integer.MAX_VALUE;
 7 
 8         Node foundConnecting = null;
 9         Node foundCanConnect = null;
10         Node foundReady = null;
11 
12         int offset = this.randOffset.nextInt(nodes.size());
13         for (int i = 0; i < nodes.size(); i++) {
14             int idx = (offset + i) % nodes.size();
15             Node node = nodes.get(idx);
16             if (canSendRequest(node.idString(), now)) {
17                 int currInflight = this.inFlightRequests.count(node.idString());
18                 if (currInflight == 0) {
19                     // if we find an established connection with no in-flight requests we can stop right away
20                     log.trace("Found least loaded node {} connected with no in-flight requests", node);
21                     return node;
22                 } else if (currInflight < inflight) {
23                     // otherwise if this is the best we have found so far, record that
24                     inflight = currInflight;
25                     foundReady = node;
26                 }
27             } else if (connectionStates.isPreparingConnection(node.idString())) {
28                 foundConnecting = node;
29             } else if (canConnect(node, now)) {
30                 if (foundCanConnect == null ||
31                         this.connectionStates.lastConnectAttemptMs(foundCanConnect.idString()) >
32                                 this.connectionStates.lastConnectAttemptMs(node.idString())) {
33                     foundCanConnect = node;
34                 }
35             } else {
36                 log.trace("Removing node {} from least loaded node selection since it is neither ready " +
37                         "for sending or connecting", node);
38             }
39         }

Kafka客户端的线处理主程序发起的请求。顺着代码走下去,发现 leastLoadedNode

1  pollTimeout = Math.min(pollTimeout, sendEligibleCalls(now));

现在是打电话的流程,根据要打电话的名单列表中一个一个的打电话。

1     private long sendEligibleCalls(long now) {
 2             long pollTimeout = Long.MAX_VALUE;
 3             for (Iterator<Map.Entry<Node, List<Call>>> iter = callsToSend.entrySet().iterator(); iter.hasNext(); ) {
 4                 Map.Entry<Node, List<Call>> entry = iter.next();
 5                 List<Call> calls = entry.getValue();
 6                 if (calls.isEmpty()) {
 7                     iter.remove();
 8                     continue;
 9                 }
10                 Node node = entry.getKey();
11                 if (!client.ready(node, now)) {
12                     long nodeTimeout = client.pollDelayMs(node, now);
13                     pollTimeout = Math.min(pollTimeout, nodeTimeout);
14                     log.trace("Client is not ready to send to {}. Must delay {} ms", node, nodeTimeout);
15                     continue;
16                 }
17                 Call call = calls.remove(0);
18                 int requestTimeoutMs = Math.min(KafkaAdminClient.this.requestTimeoutMs,
19                         calcTimeoutMsRemainingAsInt(now, call.deadlineMs));
20                 AbstractRequest.Builder<?> requestBuilder;
21                 try {
22                     requestBuilder = call.createRequest(requestTimeoutMs);
23                 } catch (Throwable throwable) {
24                     call.fail(now, new KafkaException(String.format(
25                         "Internal error sending %s to %s.", call.callName, node)));
26                     continue;
27                 }
28                 ClientRequest clientRequest = client.newClientRequest(node.idString(), requestBuilder, now,
29                         true, requestTimeoutMs, null);
30                 log.debug("Sending {} to {}. correlationId={}", requestBuilder, node, clientRequest.correlationId());
31                 client.send(clientRequest, now);
32                 getOrCreateListValue(callsInFlight, node.idString()).add(call);
33                 correlationIdToCalls.put(clientRequest.correlationId(), call);
34             }
35             return pollTimeout;
36         }

client.ready

1     public boolean ready(Node node, long now) {
 2         if (node.isEmpty())
 3             throw new IllegalArgumentException("Cannot connect to empty node " + node);
 4 
 5         if (isReady(node, now))
 6             return true;
 7 
 8         if (connectionStates.canConnect(node.idString(), now))
 9             // if we are interested in sending to a node and we don't have a connection to it, initiate one
10             initiateConnect(node, now);
11 
12         return false;
13     }

isReady(node, now) 中检查给定 id 的节点是否准备好发送更多请求 canSendRequest

1     private boolean canSendRequest(String node, long now) {
2         return connectionStates.isReady(node, now) && selector.isChannelReady(node) &&
3             inFlightRequests.canSendMore(node);
4     }

节点没准备好就会初始化连接, connectionStates.canConnect

1     private void initiateConnect(Node node, long now) {
 2         String nodeConnectionId = node.idString();
 3         try {
 4             connectionStates.connecting(nodeConnectionId, now, node.host(), clientDnsLookup);
 5             InetAddress address = connectionStates.currentAddress(nodeConnectionId);
 6             log.debug("Initiating connection to node {} using address {}", node, address);
 7             selector.connect(nodeConnectionId,
 8                     new InetSocketAddress(address, node.port()),
 9                     this.socketSendBuffer,
10                     this.socketReceiveBuffer);
11         } catch (IOException e) {
12             log.warn("Error connecting to node {}", node, e);
13             // Attempt failed, we'll try again after the backoff
14             connectionStates.disconnected(nodeConnectionId, now);
15             // Notify metadata updater of the connection failure
16             metadataUpdater.handleServerDisconnect(now, nodeConnectionId, Optional.empty());
17         }
18     }

通过Kafka的Selector(底层用的还是NIO的Selector)进行连接节点,将这个连接添加到与给定 id 号关联的这个 nioSelector。

1     @Override
 2     public void connect(String id, InetSocketAddress address, int sendBufferSize, int receiveBufferSize) throws IOException {
 3         ensureNotRegistered(id);
 4         SocketChannel socketChannel = SocketChannel.open();
 5         SelectionKey key = null;
 6         try {
 7             configureSocketChannel(socketChannel, sendBufferSize, receiveBufferSize);
 8             boolean connected = doConnect(socketChannel, address);
 9             key = registerChannel(id, socketChannel, SelectionKey.OP_CONNECT);
10 
11             if (connected) {
12                 // OP_CONNECT won't trigger for immediately connected channels
13                 log.debug("Immediately connected to node {}", id);
14                 immediatelyConnectedKeys.add(key);
15                 key.interestOps(0);
16             }
17         } catch (IOException | RuntimeException e) {
18             if (key != null)
19                 immediatelyConnectedKeys.remove(key);
20             channels.remove(id);
21             socketChannel.close();
22             throw e;
23         }
24     }

但是这里进行连接时返回的是false,初始化连接失败。回到打电话的主流程中,打电话这个节点就被跳过了,通话列表里只有这个节点。

java 连接kafka集群配置 java连接kafka超时_java 连接kafka集群配置_02

最后,Kafka客户端主线程每次while循环都会检查每个Call的请求超时的Call请求,并将其移除请求Call列表。

java 连接kafka集群配置 java连接kafka超时_apache_03

最后因为连接不上Kafka服务器,连接超时,客户端抛出超时异常。

最后才在控制台的输出发现,Kafka客户端一直尝试连接的节点ip地址是127.0.0.1,这不是本地回环地址吗,本地也没有启动Kafka,那应该是服务器配置问题了。因为在服务器上部署Kafka的时候并没有什么异常,服务器上也能正常使用生产者和消费者的控制台命令,进行生产/消费消息,那应该是缺少什么才会导致寻找的结点IP错误。

1 19519 [kafka-admin-client-thread | adminclient-1] WARN org.apache.kafka.clients.NetworkClient - [AdminClient clientId=adminclient-1] Connection to node 0 (localhost.localdomain/127.0.0.1:9092) could not be established. Broker may not be available.

 

三、从Kafka服务器配置排查问题

然后转向排查服务器配置,观察命令行启动Kafka服务器的输出日志:

[2021-11-08 18:51:46,987] INFO KafkaConfig values: 
    advertised.host.name = null
    advertised.listeners = null
    advertised.port = null
    ...
    host.name=null
    ...
    listeners = null
    ...
    port = 9092
    ...
 (kafka.server.KafkaConfig)

看到其配置参数的值有很多null,在config目录下的server.properties启动配置文件中,也只提到过listeners这个参数,还是注释里的。此时把问题转向Kafka的配置问题上,查查这些值为空的参数的作用。

listeners 由于 Kafka Connect 旨在作为服务运行,因此它还提供了用于管理连接器的 REST API。 可以使用 listeners

advertised.host.name 已弃用,仅在用在未使用 listeners 或者使用了 advertised.listeners 的器情况下。有 advertised.listeners

advertised.port

advertised.listeners 如果与 listeners

 

四、得出结论

可以看出, advertised.listeners,以便客户端能正确的连接Kafka,这个没有设定配置的话,就会默认使用listeners的值,这样外网的客户端就尝试着去使用内网的ip去连接代理服务器,在IaaS外部使用其内网的ip去访问根本无法连的上的,所以一直连不上最后会出现超时异常。

1 19519 [kafka-admin-client-thread | adminclient-1] WARN org.apache.kafka.clients.NetworkClient - [AdminClient clientId=adminclient-1] Connection to node 0 (localhost.localdomain/127.0.0.1:9092) could not be established. Broker may not be available.

 

五、解决办法

Kafka服务器启动的配置文件 server.properties 中正确配置 advertised.listeners

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from 
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://外网ip:9092

 

参考:

Kafka官网文档

Kafka Listeners - Explained