💡 文章概要:上一篇文章介绍了几种内存分析工具,在工作过程中,使用了这些工具解决很多线上的问题。本篇文章中使用了MAT工具,帮助分析出了resourcemanager泄漏的原因。

1. 背景

2021年线上Hadoop升级到了Hadoop3.3.0版本。运行了6个月,发现resourcemanager监控面板中,堆内存一直缓慢增长,每隔一段时间服务都要重启一次,十分影响服务稳定性,因此需要排查出来内存泄漏原因。

2. 分析

2.1 分析思路

面对内存泄漏的问题,首先要知道哪些对象泄漏了,其次分析为什么代码没有将这些对象清理。查看泄漏的对象,一般需要使用VisualVM或者MAT打开hprof,看看对象的分布。找到泄漏的对象后,再就需要定位到具体的代码,修改代码并重新测试/上线。

2.2 泄漏对象排查

由于resourcemanager堆内存大小是16GB,下载到本地也没有足够内存打开。因此选择在resourcemanager将hprof通过mat分析成为HTML,在本地直接查看分析后的报告。

打开Leak_Suspects页面,发现分析工具已经提供了分析结果。发现DelegationTokenRenewer实例占据75%的堆内存,如下所示:

Untitled.png

同时发现DelegationTokenRenewer实例占据10GB内存,如下所示:

Untitled 1.png

DelegationTokenRenewer所在的对象层级如下:

Untitled 2.png

这10GB内存是HashMap的内部类Node实例占据:HashMap是DelegationTokenRenewer类的成员变量futures的类型:

Untitled 3.png

查看Node占用的Shallow Heap,超过10GB的内存没有释放:

Untitled 4.png

2.3 代码定位

DelegationTokenRenewer类有Map<DelegationTokenRenewerEvent, Future<?>>类型的成员变量futures:

private final Map<DelegationTokenRenewerEvent, Future<?>> futures =
      new HashMap<>();

HashMap有Node<K,V>[]类型的成员变量table:

    transient Node<K,V>[] table;

Node内部类保存DelegationTokenRenewerEvent实例为Key,Future<?>实例为Value:

final int hash;
final K key;
V value;
Node<K,V> next;

由于发生了内存泄漏,需要检查是否是DelegationTokenRenewerEvent实例没有释放,或者Future<?>实例没有释放,或者Future实例多插入了对象。

检查Future对象是否插入对象异常

processDelegationTokenRenewerEvent方法负责向futures中添加event对象,这一步应该没有问题,出问题原因应该是没有及时清除:

 private void processDelegationTokenRenewerEvent(
      DelegationTokenRenewerEvent evt) {
    serviceStateLock.readLock().lock();
    try {
      if (isServiceStarted) {
        Future<?> future =
            renewerService.submit(new DelegationTokenRenewerRunnable(evt));
        futures.put(evt, future);
      } else {
        pendingEventQueue.add(evt);
      }
    } finally {
      serviceStateLock.readLock().unlock();
    }
  }

检查Future对象是否清理异常 直接看处理逻辑,发现只有在触发TimeoutException时,才会删除futures对象。这就是内存泄漏的直接原因:

public void run() {
      while (true) {
        for (Map.Entry<DelegationTokenRenewerEvent, Future<?>> entry : futures
            .entrySet()) {
          DelegationTokenRenewerEvent evt = entry.getKey();
          Future<?> future = entry.getValue();
          try {
            future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
          } catch (TimeoutException e) {

            // Cancel thread and retry the same event in case of timeout
            if (future != null && !future.isDone() && !future.isCancelled()) {
              future.cancel(true);
              #删除event
              futures.remove(evt);
              if (evt.getAttempt() < tokenRenewerThreadRetryMaxAttempts) {
                renewalTimer.schedule(
                    getTimerTask((AbstractDelegationTokenRenewerAppEvent) evt),
                    tokenRenewerThreadRetryInterval);
              } else {
                LOG.info(
                    "Exhausted max retry attempts {} in token renewer "
                        + "thread for {}",
                    tokenRenewerThreadRetryMaxAttempts, evt.getApplicationId());
              }
            }
          } catch (Exception e) {
            LOG.info("Problem in submitting renew tasks in token renewer "
                + "thread.", e);
          }
        }
      }
    }
  }

因此,在try-catch结束后,删除执行成功的future:

Untitled 5.png

将代码上线到测试环境,观察发现没有没有效果,resourcemanager依然内存溢出。

为了找到报错信息,我这里增加一些日志进行打印,看看是哪里出问题了。其实还可以通过远程调试的方式排查原因。注意,测试/调试都需要在测试上的resourcemanager执行,不能在正式机器上执行:

private final class DelegationTokenRenewerPoolTracker
      implements Runnable {

    DelegationTokenRenewerPoolTracker() {
    }

    /**
     * Keep traversing <Future> of renewer pool threads and wait for specific
     * timeout. In case of timeout exception, retry the event till no. of
     * attempts reaches max attempts with specific interval.
     */
    @Override
    public void run() {
      while (true) {
        //此处增加日志,查看线程是否一直在跑
        LOG.info("===allways in the loop===");
        for (Map.Entry<DelegationTokenRenewerEvent, Future<?>> entry : futures
            .entrySet()) {
          DelegationTokenRenewerEvent evt = entry.getKey();
          Future<?> future = entry.getValue();
          try {
            //此处增加日志,查看是否有token更新事件需要检测
            LOG.info("===before future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);===");
            LOG.info("===" + tokenRenewerThreadTimeout + "ms===");
            future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
          } catch (TimeoutException e) {
            //最终应该执行的位置
            LOG.info("=== going to TimeoutException to deal with DelegationTokenRenewerEvent===");
            // Cancel thread and retry the same event in case of timeout
            if (future != null && !future.isDone() && !future.isCancelled()) {
              LOG.info("===see DelegationTokenRenewerPoolTracker is about to delete event===");
              future.cancel(true);
              futures.remove(evt);
              if (evt.getAttempt() < tokenRenewerThreadRetryMaxAttempts) {
                renewalTimer.schedule(
                    getTimerTask((AbstractDelegationTokenRenewerAppEvent) evt),
                    tokenRenewerThreadRetryInterval);
              } else {
                LOG.info(
                    "Exhausted max retry attempts {} in token renewer "
                        + "thread for {}",
                    tokenRenewerThreadRetryMaxAttempts, evt.getApplicationId());
              }
            }
          } catch (Exception e) {
            LOG.info("Problem in submitting renew tasks in token renewer "
                + "thread.", e);
          }
        }
      }
    }
  }

启动hadoop 3.3.0 resourcemanager后,DelegationTokenRenewerPoolTracker线程应该一直循环执行,查看该线程中所有打印的日志信息:

hadoop@gdc-nn01-overseatestnew:~/logs$ grep "allways in" hadoop-hadoop-resourcemanager-gdc-nn01-overseatestnew.i.nease.net.log
2021-10-13 21:05:33,801 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: ===allways in the loop===
...
//死循环般地打印日志
...
2021-10-13 21:05:33,804 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: ===allways in the loop===

上面就是所有日志了,可以看到,线程只跑了1s,就退出了。因此,内存溢出的最大原因是清理futures的线程异常退出了,解决线程异常退出的问题,应该就能解决内存溢出问题。

由于线程异常退出无法通过进程将异常日志打印出来,只能在最外层加上try-catch,在catch中接受异常信息并打印:

public void run() {
      try{
      while (true) {
        LOG.info("===allways in the loop===");
        for (Map.Entry<DelegationTokenRenewerEvent, Future<?>> entry : futures
            .entrySet()) {
          DelegationTokenRenewerEvent evt = entry.getKey();
          Future<?> future = entry.getValue();
          try {
            LOG.info("===before future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);===");
            LOG.info("===" + tokenRenewerThreadTimeout + "ms===");
            future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
          } catch (TimeoutException e) {
            LOG.info("=== going to TimeoutException to deal with DelegationTokenRenewerEvent===");
            // Cancel thread and retry the same event in case of timeout
            if (future != null && !future.isDone() && !future.isCancelled()) {
              LOG.info("===see DelegationTokenRenewerPoolTracker is about to delete event===");
              future.cancel(true);
              futures.remove(evt);
              if (evt.getAttempt() < tokenRenewerThreadRetryMaxAttempts) {
                renewalTimer.schedule(
                    getTimerTask((AbstractDelegationTokenRenewerAppEvent) evt),
                    tokenRenewerThreadRetryInterval);
              } else {
                LOG.info(
                    "Exhausted max retry attempts {} in token renewer "
                        + "thread for {}",
                    tokenRenewerThreadRetryMaxAttempts, evt.getApplicationId());
              }
            }
          } catch (Exception e) {
            LOG.info("Problem in submitting renew tasks in token renewer "
                + "thread.", e);
          }

          /*if(future.isDone()){
            LOG.info("===task is done, remove the map element===");
            futures.remove(evt);
          }*/
        }
        /*try {
          Thread.sleep(1000);
        } catch (InterruptedException e) {
          e.printStackTrace();
        }*/
      }
    }catch (Exception e){
        LOG.info("=== test DelegationTokenRenewerPoolTracker dive into Exception ===");
        LOG.info("===========");
        e.printStackTrace();
        StringWriter sw = new StringWriter();
        PrintWriter pw = new PrintWriter(sw);
        try  {
          e.printStackTrace(pw);
          LOG.info(sw.toString());
        }finally {
          pw.close();
        }
        LOG.info("===========");
        LOG.info(e.toString());
        LOG.info("===========");
      }
      LOG.info("===goodbye===");
    }

部署到测试环境后,发现报错ConcurrentModificationException:

2021-10-14 11:47:49,362 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: === test DelegationTokenRenewerPoolTracker dive into Exception ===
2021-10-14 11:47:49,363 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1442)
        at java.util.HashMap$EntryIterator.next(HashMap.java:1476)
        at java.util.HashMap$EntryIterator.next(HashMap.java:1474)
        at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerPoolTracker.run(DelegationTokenRenewer.java:997)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-10-14 11:47:49,363 INFO org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: ===goodbye===

找到抛ConcurrentModificationException异常的代码。HashMap是线程不安全,在遍历时,不允许插入对象。当修改的Node数量和初次遍历时的数量不一致时,直接抛异常,导致线程退出:

final Node<K,V> nextNode() {
            Node<K,V>[] t;
            Node<K,V> e = next;
            //
            if (modCount != expectedModCount)
                throw new ConcurrentModificationException();
            if (e == null)
                throw new NoSuchElementException();
            if ((next = (current = e).next) == null && (t = table) != null) {
                do {} while (index < t.length && (next = t[index++]) == null);
            }
            return e;
        }

解决方案:将futures类型从HashMap改成ConcurrentHashMap即可。 Untitled 6.png

3. 效果检查

查看DelegationTokenRenewerPoolTracker线程是否启动

发现DelegationTokenRenewerPoolTracker线程已经启动:

Untitled 7.png

resourcemanager内存历史使用情况

发现resourcemanager没有内存溢出迹象:

Untitled 8.png

具体效果还需要观察Heap Memory1~2个月的时间。

4. 深入探究异常原因

DelegationTokenRenew包含四类线程:

  • DelegationTokenCancelThread:用于删除token。
  • DelayedTokenRemovalRunnable:收集将要删除的token信息。
  • DelegationTokenRenewerPoolTracker:跟进event的处理进度,进行失败重试。
  • DelegationTokenRenewerRunnable:执行event。

这四类线程共同维护Token的更新与删除。他们的关系如下所示:

Untitled 9.png

如下,可以看到DelegationTokenRenewerPoolTracker在查看DelegationTokenRenewerEvent时,没有删除DelegationTokenRenewerEvent,导致了溢出:

Untitled 10.png