💡 文章概要:上一篇文章介绍了几种内存分析工具,在工作过程中,使用了这些工具解决很多线上的问题。本篇文章中使用了MAT工具,帮助分析出了resourcemanager泄漏的原因。
1. 背景
2021年线上Hadoop升级到了Hadoop3.3.0版本。运行了6个月,发现resourcemanager监控面板中,堆内存一直缓慢增长,每隔一段时间服务都要重启一次,十分影响服务稳定性,因此需要排查出来内存泄漏原因。
2. 分析
2.1 分析思路
面对内存泄漏的问题,首先要知道哪些对象泄漏了,其次分析为什么代码没有将这些对象清理。查看泄漏的对象,一般需要使用VisualVM或者MAT打开hprof,看看对象的分布。找到泄漏的对象后,再就需要定位到具体的代码,修改代码并重新测试/上线。
2.2 泄漏对象排查
由于resourcemanager堆内存大小是16GB,下载到本地也没有足够内存打开。因此选择在resourcemanager将hprof通过mat分析成为HTML,在本地直接查看分析后的报告。
打开Leak_Suspects页面,发现分析工具已经提供了分析结果。发现DelegationTokenRenewer实例占据75%的堆内存,如下所示:
同时发现DelegationTokenRenewer实例占据10GB内存,如下所示:
DelegationTokenRenewer所在的对象层级如下:
这10GB内存是HashMap的内部类Node实例占据:HashMap是DelegationTokenRenewer类的成员变量futures的类型:
查看Node占用的Shallow Heap,超过10GB的内存没有释放:
2.3 代码定位
DelegationTokenRenewer类有Map<DelegationTokenRenewerEvent, Future<?>>类型的成员变量futures:
private final Map<DelegationTokenRenewerEvent, Future<?>> futures =
new HashMap<>();
HashMap有Node<K,V>[]类型的成员变量table:
transient Node<K,V>[] table;
Node内部类保存DelegationTokenRenewerEvent实例为Key,Future<?>实例为Value:
final int hash;
final K key;
V value;
Node<K,V> next;
由于发生了内存泄漏,需要检查是否是DelegationTokenRenewerEvent实例没有释放,或者Future<?>实例没有释放,或者Future实例多插入了对象。
检查Future对象是否插入对象异常
processDelegationTokenRenewerEvent方法负责向futures中添加event对象,这一步应该没有问题,出问题原因应该是没有及时清除:
private void processDelegationTokenRenewerEvent(
DelegationTokenRenewerEvent evt) {
serviceStateLock.readLock().lock();
try {
if (isServiceStarted) {
Future<?> future =
renewerService.submit(new DelegationTokenRenewerRunnable(evt));
futures.put(evt, future);
} else {
pendingEventQueue.add(evt);
}
} finally {
serviceStateLock.readLock().unlock();
}
}
检查Future对象是否清理异常 直接看处理逻辑,发现只有在触发TimeoutException时,才会删除futures对象。这就是内存泄漏的直接原因:
public void run() {
while (true) {
for (Map.Entry<DelegationTokenRenewerEvent, Future<?>> entry : futures
.entrySet()) {
DelegationTokenRenewerEvent evt = entry.getKey();
Future<?> future = entry.getValue();
try {
future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// Cancel thread and retry the same event in case of timeout
if (future != null && !future.isDone() && !future.isCancelled()) {
future.cancel(true);
#删除event
futures.remove(evt);
if (evt.getAttempt() < tokenRenewerThreadRetryMaxAttempts) {
renewalTimer.schedule(
getTimerTask((AbstractDelegationTokenRenewerAppEvent) evt),
tokenRenewerThreadRetryInterval);
} else {
LOG.info(
"Exhausted max retry attempts {} in token renewer "
+ "thread for {}",
tokenRenewerThreadRetryMaxAttempts, evt.getApplicationId());
}
}
} catch (Exception e) {
LOG.info("Problem in submitting renew tasks in token renewer "
+ "thread.", e);
}
}
}
}
}
因此,在try-catch结束后,删除执行成功的future:
将代码上线到测试环境,观察发现没有没有效果,resourcemanager依然内存溢出。
为了找到报错信息,我这里增加一些日志进行打印,看看是哪里出问题了。其实还可以通过远程调试的方式排查原因。注意,测试/调试都需要在测试上的resourcemanager执行,不能在正式机器上执行:
private final class DelegationTokenRenewerPoolTracker
implements Runnable {
DelegationTokenRenewerPoolTracker() {
}
/**
* Keep traversing <Future> of renewer pool threads and wait for specific
* timeout. In case of timeout exception, retry the event till no. of
* attempts reaches max attempts with specific interval.
*/
@Override
public void run() {
while (true) {
//此处增加日志,查看线程是否一直在跑
LOG.info("===allways in the loop===");
for (Map.Entry<DelegationTokenRenewerEvent, Future<?>> entry : futures
.entrySet()) {
DelegationTokenRenewerEvent evt = entry.getKey();
Future<?> future = entry.getValue();
try {
//此处增加日志,查看是否有token更新事件需要检测
LOG.info("===before future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);===");
LOG.info("===" + tokenRenewerThreadTimeout + "ms===");
future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
//最终应该执行的位置
LOG.info("=== going to TimeoutException to deal with DelegationTokenRenewerEvent===");
// Cancel thread and retry the same event in case of timeout
if (future != null && !future.isDone() && !future.isCancelled()) {
LOG.info("===see DelegationTokenRenewerPoolTracker is about to delete event===");
future.cancel(true);
futures.remove(evt);
if (evt.getAttempt() < tokenRenewerThreadRetryMaxAttempts) {
renewalTimer.schedule(
getTimerTask((AbstractDelegationTokenRenewerAppEvent) evt),
tokenRenewerThreadRetryInterval);
} else {
LOG.info(
"Exhausted max retry attempts {} in token renewer "
+ "thread for {}",
tokenRenewerThreadRetryMaxAttempts, evt.getApplicationId());
}
}
} catch (Exception e) {
LOG.info("Problem in submitting renew tasks in token renewer "
+ "thread.", e);
}
}
}
}
}
启动hadoop 3.3.0 resourcemanager后,DelegationTokenRenewerPoolTracker线程应该一直循环执行,查看该线程中所有打印的日志信息:
hadoop@gdc-nn01-overseatestnew:~/logs$ grep "allways in" hadoop-hadoop-resourcemanager-gdc-nn01-overseatestnew.i.nease.net.log
2021-10-13 21:05:33,801 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: ===allways in the loop===
...
//死循环般地打印日志
...
2021-10-13 21:05:33,804 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: ===allways in the loop===
上面就是所有日志了,可以看到,线程只跑了1s,就退出了。因此,内存溢出的最大原因是清理futures的线程异常退出了,解决线程异常退出的问题,应该就能解决内存溢出问题。
由于线程异常退出无法通过进程将异常日志打印出来,只能在最外层加上try-catch,在catch中接受异常信息并打印:
public void run() {
try{
while (true) {
LOG.info("===allways in the loop===");
for (Map.Entry<DelegationTokenRenewerEvent, Future<?>> entry : futures
.entrySet()) {
DelegationTokenRenewerEvent evt = entry.getKey();
Future<?> future = entry.getValue();
try {
LOG.info("===before future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);===");
LOG.info("===" + tokenRenewerThreadTimeout + "ms===");
future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
LOG.info("=== going to TimeoutException to deal with DelegationTokenRenewerEvent===");
// Cancel thread and retry the same event in case of timeout
if (future != null && !future.isDone() && !future.isCancelled()) {
LOG.info("===see DelegationTokenRenewerPoolTracker is about to delete event===");
future.cancel(true);
futures.remove(evt);
if (evt.getAttempt() < tokenRenewerThreadRetryMaxAttempts) {
renewalTimer.schedule(
getTimerTask((AbstractDelegationTokenRenewerAppEvent) evt),
tokenRenewerThreadRetryInterval);
} else {
LOG.info(
"Exhausted max retry attempts {} in token renewer "
+ "thread for {}",
tokenRenewerThreadRetryMaxAttempts, evt.getApplicationId());
}
}
} catch (Exception e) {
LOG.info("Problem in submitting renew tasks in token renewer "
+ "thread.", e);
}
/*if(future.isDone()){
LOG.info("===task is done, remove the map element===");
futures.remove(evt);
}*/
}
/*try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}*/
}
}catch (Exception e){
LOG.info("=== test DelegationTokenRenewerPoolTracker dive into Exception ===");
LOG.info("===========");
e.printStackTrace();
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
try {
e.printStackTrace(pw);
LOG.info(sw.toString());
}finally {
pw.close();
}
LOG.info("===========");
LOG.info(e.toString());
LOG.info("===========");
}
LOG.info("===goodbye===");
}
部署到测试环境后,发现报错ConcurrentModificationException:
2021-10-14 11:47:49,362 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: === test DelegationTokenRenewerPoolTracker dive into Exception ===
2021-10-14 11:47:49,363 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1442)
at java.util.HashMap$EntryIterator.next(HashMap.java:1476)
at java.util.HashMap$EntryIterator.next(HashMap.java:1474)
at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerPoolTracker.run(DelegationTokenRenewer.java:997)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-10-14 11:47:49,363 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: ===goodbye===
找到抛ConcurrentModificationException异常的代码。HashMap是线程不安全,在遍历时,不允许插入对象。当修改的Node数量和初次遍历时的数量不一致时,直接抛异常,导致线程退出:
final Node<K,V> nextNode() {
Node<K,V>[] t;
Node<K,V> e = next;
//
if (modCount != expectedModCount)
throw new ConcurrentModificationException();
if (e == null)
throw new NoSuchElementException();
if ((next = (current = e).next) == null && (t = table) != null) {
do {} while (index < t.length && (next = t[index++]) == null);
}
return e;
}
解决方案:将futures类型从HashMap改成ConcurrentHashMap即可。
3. 效果检查
查看DelegationTokenRenewerPoolTracker线程是否启动
发现DelegationTokenRenewerPoolTracker线程已经启动:
resourcemanager内存历史使用情况
发现resourcemanager没有内存溢出迹象:
具体效果还需要观察Heap Memory1~2个月的时间。
4. 深入探究异常原因
DelegationTokenRenew包含四类线程:
- DelegationTokenCancelThread:用于删除token。
- DelayedTokenRemovalRunnable:收集将要删除的token信息。
- DelegationTokenRenewerPoolTracker:跟进event的处理进度,进行失败重试。
- DelegationTokenRenewerRunnable:执行event。
这四类线程共同维护Token的更新与删除。他们的关系如下所示:
如下,可以看到DelegationTokenRenewerPoolTracker在查看DelegationTokenRenewerEvent时,没有删除DelegationTokenRenewerEvent,导致了溢出: