Java的thread dump是当前JVM中所有线程状态的一个快照, 包含每个线程当前状态和调用堆栈(stack trace), thread dump对于诊断Java线程相关的问题非常有用。收集thread dump的方法有很多, 可以通过jvm自带的命令行工具jps + jstack, 其它第三方JVM诊断工具通常包含生成thread dump的功能,这里不赘述, 可以自行google/bing/baidu/… 下面主要基于我自己的一个案例来聊聊我对thread dump的理解和分析方法。

起因是我自己写的一个程序, 启动的线程没有响应(一个SwingWorker线程和一个Quartz调度), 卡在什么地方没有响应。

java threaddump java thread dump包_开发语言


遇到这类线程相关的问题,我们最想知道的是这些线程到底在干啥?被卡在什么地方了,这时候就需要通过生成thread dump,来对线程进行分析。

分析过程

生成thread dump,把对应的线程找出来(后面我再聊聊我找对应线程的一些心得):

"Quartz-Job-Scheduler_Worker-1" #47 ...
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
	- parking to wait for  <0x000000070d4b4768> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.7/AbstractQueuedSynchronizer.java:2081)
	at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:1323)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:306)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:33)
	- locked <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
	at org.littlestar.monitor.task.Task.getConnection(Task.java:246)
	at org.littlestar.monitor.task.linux.LogFileMonitor.execute(LogFileMonitor.java:110)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
	- locked <0x000000070d4b4738> (a java.lang.Object)

"Quartz-Job-Scheduler_Worker-2" #48 ...
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:32)
	- waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
	at org.littlestar.monitor.task.Task.getConnection(Task.java:246)
	at org.littlestar.monitor.task.linux.LogFileMonitor.execute(LogFileMonitor.java:110)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
	- locked <0x000000070d4bbf28> (a java.lang.Object)

"SwingWorker-pool-2-thread-1" #80 ...
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:32)
        - waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)
        at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
        at org.littlestar.monitor.task.Task.getConnection(Task.java:246)
        at org.littlestar.monitor.task.linux.LinuxFileSystemViewer.executeInBackground(LinuxFileSystemViewer.java:91)
        at org.littlestar.monitor.task.ToolkitExt$BackgroundWorker.doInBackground(ToolkitExt.java:119)
        at javax.swing.SwingWorker$1.call(java.desktop@11.0.7/SwingWorker.java:304)
        at java.util.concurrent.FutureTask.run(java.base@11.0.7/FutureTask.java:264)
        at javax.swing.SwingWorker.run(java.desktop@11.0.7/SwingWorker.java:343)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

   Locked ownable synchronizers:
        - <0x000000070d4cab60> (a java.util.concurrent.ThreadPoolExecutor$Worker)

从thread dump可以看出"SwingWorker-pool-2-thread-1"和Quartz-Job-Scheduler_Worker-2" 线程都处于阻塞状态( java.lang.Thread.State: BLOCKED),被阻塞的原因是无法获得需要的锁(waiting to lock …)。这两个线程在执行JschConnectorEntity.getConnection方法时因为无法获得锁(0x000000070d4a5a80)的锁而进入阻塞状态, 让出CPU执行:

"SwingWorker-pool-2-thread-1" #80 ...
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:32) -->
	- waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)    --> 等待锁0x000000070d4a5a80
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
	...
"Quartz-Job-Scheduler_Worker-2" #48 ...
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:32) -->
	- waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)    --> 等待锁0x000000070d4a5a80
	...

waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity) 说明等待的锁的地址是0x000000070d4a5a80,锁是一个org.littlestar.monitor.entity.JschConnectorEntity对象, 这是一个对象锁(synchronized),这个锁被谁持有呢?我们通过锁地址0x000000070d4a5a80进行查找可以看到持有该锁定的是"Quartz-Job-Scheduler_Worker-1"线程:

"Quartz-Job-Scheduler_Worker-1" #47...
    ...
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:33) --> 
	- locked <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity) --> 持有0x000000070d4a5a80
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
	at org.littlestar.monitor.task.Task.getConnection(Task.java:246)
   ...

“Quartz-Job-Scheduler_Worker-1” 正在执行JschConnectorEntity.getConnection()方法, 并持有0x000000070d4a5a80锁, 看代码(JschConnectorEntity.java:33)可以看到JschConnectorEntity.getConnection()是被synchronized关键字修饰的方法, 想要执行getConnection()方法, 首先需要获取getConnection()方法的对象的对象锁:

public class JschConnectorEntity extends ConnectorEntity<Session> {
    ...
	@Override
	public synchronized Session getConnection() throws Exception { 
	...
	}
	...
}

“Quartz-Job-Scheduler_Worker-1” 进入了同步代码块(synchornized block), 持有JschConnectorEntity对象的对象锁, 因为对象锁是互斥的,所以 “Quartz-Job-Scheduler_Worker-1” 和"Quartz-Job-Scheduler_Worker-1"无法获得锁,执行getConnection()被阻塞这是很正常的,Java线程同步机制就是这么设计的。问题在于"Quartz-Job-Scheduler_Worker-1"为什么一直持有0x000000070d4a5a80锁, 不释放呢?继续分析Quartz-Job-Scheduler_Worker-1线程:

"Quartz-Job-Scheduler_Worker-1" #47 ...
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
	- parking to wait for  <0x000000070d4b4768> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.7/AbstractQueuedSynchronizer.java:2081)
	at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:1323)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:306)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)
	at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:33)
	- locked <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)

可以看出Quartz-Job-Scheduler_Worker-1进入getConnection后,调用borrowObject方法,而borrowObject又调用了takeFirst, 该线程是在takeFirst方法调用了ConditionObject.await()后开始进入等待(WAITING)状态,Condition使用的是LockSupport实现的线程等待和唤醒,而非传统的Object.wait和Object.notify。通常,我们只需要看到LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:1323)源代码就可以知道Quartz-Job-Scheduler_Worker-1线程到底在等什么,而再往上的代码其实是await()的底层实现:

class LinkedBlockingDeque<E> extends AbstractQueue<E>
        implements Deque<E>, Serializable {
   ...
    /** Condition for waiting takes */
    private final Condition notEmpty;
    ...
 	lock = new InterruptibleReentrantLock(fairness);
    notEmpty = lock.newCondition();
    ...
    /**
     * Unlinks the first element in the queue, waiting until there is an element
     * to unlink if the queue is empty.
     *
     * @return the unlinked element
     * @throws InterruptedException if the current thread is interrupted
     */
    public E takeFirst() throws InterruptedException {
        lock.lock();
        try {
            E x;
            while ( (x = unlinkFirst()) == null) {
                // --> 等待其它线程执行notEmpty.signal()结束该线程的等待状态;
                notEmpty.await();  //--> LinkedBlockingDeque.java:1323
            }
            return x;
        } finally {
            lock.unlock();
        }
    }
}

结合所有的代码一起,诊断的逻辑大概是这样的:
1). 线程进入了同步方法JschConnectorEntity.getConnection(JschConnectorEntity.java:33), 持有了对象锁
2). getConnection使用apache commons pool2的(GenericObjectPool)实现的一个连接池JSchSessionPool;
3). 这里你不需要关心JSchSessionPool的细节, 这是我自己的代码, 你可以把他看成一个GenericObjectPool(的实现类);
4). getConnection调用GenericObjectPool.borrowObject()从线程池中获取一个连接;

public class JschConnectorEntity extends ConnectorEntity<Session> {
	@Override
	public synchronized Session getConnection() throws Exception {
		final JSchSessionPool pool = getSessionPool();
		Session session = pool.borrowObject();
		return session;
	}
}

5). 继续看GenericObjectPool.borrowObject()方法;
6). 通过分析borrowObject()代码可知,当线程池没有空闲对象时,borrowObject()方法会进入等待, borrowObject() -> takeFirst() -> notEmpty.await();

public class GenericObjectPool<T> extends BaseGenericObjectPool<T>
        implements ObjectPool<T>, GenericObjectPoolMXBean, UsageTracking<T> {
    ...
	private final LinkedBlockingDeque<PooledObject<T>> idleObjects;
	idleObjects = new LinkedBlockingDeque<>(config.getFairness());
	
    @Override
    public T borrowObject() throws Exception {
        return borrowObject(getMaxWaitDuration());
    }
    // 因为是第3方库, 不是自己的代码, 要了解意图, 最快捷的方法是阅读代码注释;
    /**
     * ...
     * <p>
     * If the pool is exhausted (no available idle instances and no capacity to
     * create new ones), this method will either block (if
     * {@link #getBlockWhenExhausted()} is true) or throw a
     * {@code NoSuchElementException} (if
     * {@link #getBlockWhenExhausted()} is false). The length of time that this
     * method will block when {@link #getBlockWhenExhausted()} is true is
     * determined by the value passed in to the {@code borrowMaxWaitMillis}
     * parameter.
     * </p>
     * <p>
     * When the pool is exhausted, multiple calling threads may be
     * simultaneously blocked waiting for instances to become available. A
     * "fairness" algorithm has been implemented to ensure that threads receive
     * available instances in request arrival order.
     * </p>
    */
    public T borrowObject(final Duration borrowMaxWaitDuration) throws Exception {
    	...
    	 p = idleObjects.takeFirst();
    	...
    }
}

7). 知道了等待原因是连接池没有空闲连接可用,造成的原因要么就是线程池配置太小了; 要么就是从线程池借了(borrowObject)线程,但没有将线程(returnObject)归还给线程池;

8). 通常连接池设置再小,只要使用完后返回连接池, 不会等待如此长时间, 重点放在排查连接归还问题,搜索执行了getConnection的代码,检查并补上缺失的归还连接池代码后问题解决。

一些个人心得

  1. 一个线程处于等待(WAITING) 或者被阻塞(BLOCKED )都是线程执行过程中正常的状态, 除了检测到死锁, 对于一个thread dump, 不能简单的通过线程状态来判断一个线程是否有问题, thread dump是一个瞬时状态(快照), 一个线程你执行thread dump时是阻塞的, 下一秒可能就因为满足条件被唤醒继续执行了, 所以单靠线程状态说明任何问题。一个线程线程长期处于BLOCKED状态是值得怀疑的,因为thread dump是瞬时值,你通常需要收集多个thread dump来进行比对验证, 一些第三方工具更直观, 特别是你可以进行线上诊断时, 例如开源的VisualVM:

    上面标蓝色的线程,通常表示处于BLOCKED状态的线程,黄色和橙色则通常表示处于WAITING状态的线程。一个线程长时间处于BLOCKED状态,通常是因为长时间无法获得锁,如上图Quartz-Job-Scheduler_Worker-2和SwingWorker-pool-2-thread-1线程,都是一条纯蓝色线条,说明从VisualVM监控开始,一直都是BLOCKED状态,从后面的Runing的时间0ms也可以看出。另外,VisualVM的插件"Threads Inspector"也非常推荐,你只要勾选你关心的线程,就能生成对应线程的call stack,非常好用。

对于thread dump分析,因为跟源代码密切相关,通常要有源代码,一些开源的库都会提供源代码包(*-source.jar)。thread dump需要对要分析的软件或者使用的库有比较深入的了解, 有时候你通过故障现象,就能猜到到大概是什么线程或者方法有问题,分析thread dump本身没太多通用的技巧,跟多是对代码的逻辑进行分析,通常软件的开发者在分析上更有先天的优势。

一个线程的stack trace都会有对应源代码的位置, 例如:

at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223) --> GenericObjectPool.java的第223行。

我们通常使用一些工具来阅读源代码以提高效率, 如Eclipse,把需要jar文件和源代码文件jar文件都放到一个项目内,通过Eclipse的功能,就非常方便的对源代码进行阅读, 以Eclipse为例,在java文件写入: org.apache.commons.pool2.impl.GenericObjectPool, 选择该类/方法/变量/… 右键选"Open Declaration", 就能够导航到对应的类和方法的源代码定义。

java threaddump java thread dump包_java_02

  1. 堆栈信息(stack trace)调用的顺序是从下到上的, 也就是线程从堆栈最底的方法开始的调用, 一直到stack trace最顶部的方法;
  2. stack trace最底部不代表线程开始执行的位置, thread dump是看不出线程是怎么启动的(启动线程的代码位置), 因为执行线程启动方法是不会阻塞的, 代码会继续往下执行,创建线程的方法通常执行结束就出栈了。观察所有线程的stack trace, 最底部执行的都是run()方法,这是因为java创建的线程,无论是通过Thread, 还是其它什么线程池的方式, 线程执行开始的地方通常都是Runnable接口的run()方法。
Runnable.run() -  When an object implementing interface Runnable is used to create a thread, starting the thread causes 
the object's run method to be called in that separately executing thread.

例如: 上面的SwingWorker-pool-2-thread-1, 线程是在一个Swing菜单事件响应方法中启动的:

SwingWorker<Object, Void> backgroundWorker;
	...
	backgroundWorker.execute();

从stack trace最底部开始, 到调用你写的代码之前, 实际上都是SwingWorker内部的调用, 通常我们不需要关心这部分内容, 我们分析一个线程调用堆栈,只需要从底部的第一条自己编写的代码(非JVM的代码)开始。当然你有兴趣从SwingWorker.execute()方法一直跟踪下去,也能走出下列调用:

...
	at javax.swing.SwingWorker$1.call(java.desktop@11.0.7/SwingWorker.java:304)
	at java.util.concurrent.FutureTask.run(java.base@11.0.7/FutureTask.java:264)
	at javax.swing.SwingWorker.run(java.desktop@11.0.7/SwingWorker.java:343)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
  1. 线程问题的诊断,特别是多线程,通常逃不掉锁的问题,thread dump包含了线程持有锁和等待锁的信息,根据锁类型不同,在thread dump对锁的信息的显示也稍有不同, 主要有两类:

(1). 一个是传统的对象锁(synchronized): locked … /waiting to lock …

进入了同步代码块, 就说明持有了该对象锁, 如:前面提到的synchronized方法, 或者一些同步块,如:
Object lock = new Object();
synchronized(lock) {		
  //...	
}
这类锁定的持有线程,都会在stack trace对应的位置标识锁信息: "locked <锁定地址> (锁对象的类) " 

当线程尝试进入同步代码块,但是无法获取锁,线程stack trace对应代码位置将标识: waiting to lock <锁定地址> (锁对象的类)

(2). 而另一个是线程持有同步器Synchronizer锁(AbstractOwnableSynchronizer的子类, 通常是AQS):
这类锁持有信息会在线程的调用堆栈的下面显示,如, 上例"SwingWorker-pool-2-thread-1"线程

Locked ownable synchronizers:
       - <0x000000070f5049a8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

锁类型是ThreadPoolExecutor的内部类Worker, Worker类继承AQS(AbstractQueuedSynchronizer),而AQS则继承自AbstractOwnableSynchronizer:

private final class Worker
        extends AbstractQueuedSynchronizer ----> extends AbstractOwnableSynchronizer
        implements Runnable {
		...
}

加锁定是在ThreadPoolExecutor.runWorker方法内, 没有对象锁定这么直观一眼就能找到位置:

final void runWorker(Worker w) {
        ...
        w.lock();
        ...       
    }

对于这类锁定, 我们更常用的是可重入锁(ReentrantLock), ReentrantLock内部使用使用了AQS实现lock/unlock。举个简单例子:

class Test3 {
	public static void main(String[] args) throws Exception {
		ReentrantLock lock = new ReentrantLock();
		一个线程持有锁定后进入等待
		Thread thread1 = new Thread(() -> {
			lock.lock();
			try {
				TimeUnit.MINUTES.sleep(4);
			} catch (InterruptedException e) {
			} finally {
				lock.unlock();
			}
		});
		thread1.setName("Lock-holder");
		thread1.start();
		
		 一个线程同时尝试获取锁定
		Thread thread2 = new Thread(() -> {
			lock.lock();
			System.out.println("Got lock~");
		});
		thread2.setName("Lock-waiter");
		thread2.start();
	}
}

对应线程的thread-dump:

"Lock-holder" #21 prio=5 os_prio=0 cpu=15.63ms elapsed=22.07s tid=0x000001df30756800 nid=0x3f7c waiting on condition  [0x0000004b1f5ff000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(java.base@11.0.7/Native Method)
        at java.lang.Thread.sleep(java.base@11.0.7/Thread.java:339)
        at java.util.concurrent.TimeUnit.sleep(java.base@11.0.7/TimeUnit.java:446)
        at org.littlestar.helper.Test3.lambda$0(Test3.java:18)
        at org.littlestar.helper.Test3$$Lambda$1/0x0000000800060840.run(Unknown Source)
        at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

   Locked ownable synchronizers:
        - <0x000000071b8767c0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)  -->持有锁0x000000071b8767c0, 类型是NonfairSync,继承AQS

"Lock-waiter" #22 prio=5 os_prio=0 cpu=0.00ms elapsed=22.07s tid=0x000001df30755000 nid=0x1b10 waiting on condition  [0x0000004b1f6fe000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
        - parking to wait for  <0x000000071b8767c0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)  -->等待获取锁0x000000071b8767c0
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.7/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.7/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantLock.lock(java.base@11.0.7/ReentrantLock.java:267)
        at org.littlestar.helper.Test3.lambda$1(Test3.java:27)
        at org.littlestar.helper.Test3$$Lambda$2/0x0000000800060c40.run(Unknown Source)
        at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

   Locked ownable synchronizers:
        - None