1. 环境
服务器:CentOS Linux release 7.9.2009 (Core)
zookeeper:3.6.3
java:1.8.0_231
2. 问题
zookeeper客户端连接到zookeeper节点有20秒的延迟,而且每次都是20秒。自己编写客户端进行测试,通过日志看到在某个环节上会出现阻塞,导致延迟。
15:49:12.424 [main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=x.x.x.x:2181 sessionTimeout=5000 watcher=zk.ZKTest$1@393671df
15:49:12.427 [main] INFO org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
15:49:12.431 [main] INFO org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 1048575 Bytes
15:49:12.436 [main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false
15:49:32.465 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server x.x.x.x/x.x.x.x:2181.
15:49:32.465 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error)
15:49:32.477 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /172.20.140.23:56244, server: x.x.x.x/x.x.x.x:2181
15:49:32.498 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server x.x.x.x/x.x.x.x:2181, session id = 0x201249468f70009, negotiated timeout = 5000
3. 解决过程
3.1 编写测试客户端,客户端只是连接到zookeeper
package zk;
import org.apache.zookeeper.WatchedEvent;
import org.apache.zookeeper.Watcher;
import org.apache.zookeeper.ZooKeeper;
import org.apache.zookeeper.client.ConnectStringParser;
import org.apache.zookeeper.client.StaticHostProvider;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.net.InetSocketAddress;
import java.text.SimpleDateFormat;
import java.util.List;
import java.util.concurrent.CountDownLatch;
/**
* @author cc
* @function
* @date 2021/7/22 15:14
*/
public class ZKTest {
private static final Logger LOG = LoggerFactory.getLogger(ZKTest.class);
public static void main(String[] args) throws Exception {
ZKTest main = new ZKTest();
main.run(args[0]);
}
String date() {
return new SimpleDateFormat("HH:mm:ss.SSS").format(System.currentTimeMillis());
}
void run(String zkIps) throws Exception {
CountDownLatch latch = new CountDownLatch(1);
LOG.info("running...");
StaticHostProvider providers = new StaticHostProvider(new ConnectStringParser(zkIps).getServerAddresses());
InetSocketAddress address = providers.next(1000);
LOG.debug(" " + address.getAddress().getHostAddress());
address.getHostName();
LOG.info(" init StaticHostProvider done");
try (ZooKeeper zoo = new ZooKeeper(zkIps, 5000, new Watcher() {
@Override
public void process(WatchedEvent event) {
if (event.getState() == Event.KeeperState.SyncConnected) {
latch.countDown();;
LOG.info(" connected");
return;
}
System.out.println(event.toString());
}
})) {
latch.await();
for (int i=0; i<10; i++) {
List<String> results = zoo.getChildren("/ledgers/00/0000", null);
if (results != null) {
LOG.info(" result size " + results.size());
}
Thread.sleep(1000);
}
}
}
}
3.2 查看连接日志,确认在哪一个环节上出现阻塞
15:49:12.436 [main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false
15:49:32.465 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server x.x.x.x/x.x.x.x:2181.
结合程序日志,可以看到上边两个日志之间阻塞了20秒。
3.3 查看源代码
调用过程:
// 1.程序初始化
ZooKeeper zoo = new ZooKeeper(zkIps, 5000, new Watcher());
// 2.初始化zookeeper对象
public ZooKeeper(
String connectString,
int sessionTimeout,
Watcher watcher,
boolean canBeReadOnly,
HostProvider aHostProvider,
ZKClientConfig clientConfig) throws IOException {
LOG.info(
"Initiating client connection, connectString={} sessionTimeout={} watcher={}",
connectString,
sessionTimeout,
watcher);
if (clientConfig == null) {
clientConfig = new ZKClientConfig();
}
this.clientConfig = clientConfig;
watchManager = defaultWatchManager();
watchManager.defaultWatcher = watcher;
ConnectStringParser connectStringParser = new ConnectStringParser(connectString);
hostProvider = aHostProvider;
cnxn = createConnection(
connectStringParser.getChrootPath(),
hostProvider,
sessionTimeout,
this,
watchManager,
getClientCnxnSocket(),
canBeReadOnly);
cnxn.start();
}
// 3.查看createConnection方法
// @VisibleForTesting
protected ClientCnxn createConnection(
String chrootPath,
HostProvider hostProvider,
int sessionTimeout,
ZooKeeper zooKeeper,
ClientWatchManager watcher,
ClientCnxnSocket clientCnxnSocket,
boolean canBeReadOnly) throws IOException {
return new ClientCnxn(
chrootPath,
hostProvider,
sessionTimeout,
this,
watchManager,
clientCnxnSocket,
canBeReadOnly);
}
// 4.查看ClientCnxn初始化过程
public ClientCnxn(
String chrootPath,
HostProvider hostProvider,
int sessionTimeout,
ZooKeeper zooKeeper,
ClientWatchManager watcher,
ClientCnxnSocket clientCnxnSocket,
long sessionId,
byte[] sessionPasswd,
boolean canBeReadOnly) throws IOException {
this.zooKeeper = zooKeeper;
this.watcher = watcher;
this.sessionId = sessionId;
this.sessionPasswd = sessionPasswd;
this.sessionTimeout = sessionTimeout;
this.hostProvider = hostProvider;
this.chrootPath = chrootPath;
connectTimeout = sessionTimeout / hostProvider.size();
readTimeout = sessionTimeout * 2 / 3;
readOnly = canBeReadOnly;
sendThread = new SendThread(clientCnxnSocket);
eventThread = new EventThread();
this.clientConfig = zooKeeper.getClientConfig();
initRequestTimeout();
}
// 5.查看initRequestTimeout方法,可以看到出现阻塞的前一条日志在这个方法中打印。然后返回到第二步中,查看ClientCnxn线程启动的部分(cnxn.start())。
private void initRequestTimeout() {
try {
requestTimeout = clientConfig.getLong(
ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT,
ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT_DEFAULT);
LOG.info(
"{} value is {}. feature enabled={}",
ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT,
requestTimeout,
requestTimeout > 0);
} catch (NumberFormatException e) {
LOG.error(
"Configured value {} for property {} can not be parsed to long.",
clientConfig.getProperty(ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT),
ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT);
throw e;
}
}
// 6.查看线程启动方法(run),这里我们只看前面一部分代码。当线程启动后,会调用startConnect方法创建到zookeeper的socket连接,下面查看这个方法的内容。
@Override
public void run() {
clientCnxnSocket.introduce(this, sessionId, outgoingQueue);
clientCnxnSocket.updateNow();
clientCnxnSocket.updateLastSendAndHeard();
int to;
long lastPingRwServer = Time.currentElapsedTime();
final int MAX_SEND_PING_INTERVAL = 10000; //10 seconds
InetSocketAddress serverAddress = null;
while (state.isAlive()) {
try {
if (!clientCnxnSocket.isConnected()) {
// don't re-establish connection if we are closing
if (closing) {
break;
}
if (rwServerAddress != null) {
serverAddress = rwServerAddress;
rwServerAddress = null;
} else {
serverAddress = hostProvider.next(1000);
}
onConnecting(serverAddress);
startConnect(serverAddress);
clientCnxnSocket.updateLastSendAndHeard();
}
// 7.通过下面两个方法,我们可以确定zookeeper尝试创建ZooKeeperSaslClient连接,但是失败了。而最上边的连接日志显示是创建连接失败以上的代码造成了阻塞,由此可以确认是创建ZooKeeperSaslClient的时候阻塞了,所以需要查看ZooKeeperSaslClient创建过程。
private void startConnect(InetSocketAddress addr) throws IOException {
// initializing it for new connection
saslLoginFailed = false;
if (!isFirstConnect) {
try {
Thread.sleep(ThreadLocalRandom.current().nextLong(1000));
} catch (InterruptedException e) {
LOG.warn("Unexpected exception", e);
}
}
changeZkState(States.CONNECTING);
String hostPort = addr.getHostString() + ":" + addr.getPort();
MDC.put("myid", hostPort);
setName(getName().replaceAll("\\(.*\\)", "(" + hostPort + ")"));
if (clientConfig.isSaslClientEnabled()) {
try {
if (zooKeeperSaslClient != null) {
zooKeeperSaslClient.shutdown();
}
zooKeeperSaslClient = new ZooKeeperSaslClient(SaslServerPrincipal.getServerPrincipal(addr, clientConfig), clientConfig);
} catch (LoginException e) {
// An authentication error occurred when the SASL client tried to initialize:
// for Kerberos this means that the client failed to authenticate with the KDC.
// This is different from an authentication error that occurs during communication
// with the Zookeeper server, which is handled below.
LOG.warn(
"SASL configuration failed. "
+ "Will continue connection to Zookeeper server without "
+ "SASL authentication, if Zookeeper server allows it.", e);
eventThread.queueEvent(new WatchedEvent(Watcher.Event.EventType.None, Watcher.Event.KeeperState.AuthFailed, null));
saslLoginFailed = true;
}
}
logStartConnect(addr);
clientCnxnSocket.connect(addr);
}
private void logStartConnect(InetSocketAddress addr) {
LOG.info("Opening socket connection to server {}.", addr);
if (zooKeeperSaslClient != null) {
LOG.info("SASL config status: {}", zooKeeperSaslClient.getConfigStatus());
}
}
// 8.通过查看ZooKeeperSaslClient创建过程代码,可以看到程序在调用SaslServerPrincipal.getServerPrincipal(addr, clientConfig)方法的时候使用了getHostName方法...
static String getServerPrincipal(WrapperInetSocketAddress addr, ZKClientConfig clientConfig) {
String configuredServerPrincipal = clientConfig.getProperty(ZKClientConfig.ZOOKEEPER_SERVER_PRINCIPAL);
if (configuredServerPrincipal != null) {
// If server principal is already configured then return it
return configuredServerPrincipal;
}
String principalUserName = clientConfig.getProperty(
ZKClientConfig.ZK_SASL_CLIENT_USERNAME,
ZKClientConfig.ZK_SASL_CLIENT_USERNAME_DEFAULT);
// 此方法会在某些情况下造成阻塞(会对ip进行dns反向解析,如果dns服务不好用或者连接不上,则出现阻塞)。
String hostName = addr.getHostName();
3.4 问题修复
通过上述过程,可以确定是zookeeper默认使用了ZooKeeperSaslClient,而这个过程中调用了getHostName方法,从而造成程序阻塞。
所以可以通过在hosts文件添加上zookeeper的ip地址,使得getHostName可以直接使用hosts里面的dns结果。
在/etc/hosts文件中添加记录,记录格式如下:
ip1 host1
ip2 hosts2
4. 总结
问题排查过程比较繁琐,核心思想就是通过修改服务器、程序参数,确定延迟产生的环节,然后结合代码查看具体的处理过程。
以后一定谨慎使用InetSocketAddress的getHostName方法,这个方法会对传进去的字符串进行域名解析,即使是一个ip地址也会进行解析,而dns解析的快慢和服务器配置有关系。