Java的thread dump是当前JVM中所有线程状态的一个快照, 包含每个线程当前状态和调用堆栈(stack trace), thread dump对于诊断Java线程相关的问题非常有用。收集thread dump的方法有很多, 可以通过jvm自带的命令行工具jps + jstack, 其它第三方JVM诊断工具通常包含生成thread dump的功能,这里不赘述, 可以自行google/bing/baidu/… 下面主要基于我自己的一个案例来聊聊我对thread dump的理解和分析方法。
起因是我自己写的一个程序, 启动的线程没有响应(一个SwingWorker线程和一个Quartz调度), 卡在什么地方没有响应。
遇到这类线程相关的问题,我们最想知道的是这些线程到底在干啥?被卡在什么地方了,这时候就需要通过生成thread dump,来对线程进行分析。
分析过程
生成thread dump,把对应的线程找出来(后面我再聊聊我找对应线程的一些心得):
"Quartz-Job-Scheduler_Worker-1" #47 ...
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
- parking to wait for <0x000000070d4b4768> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.7/AbstractQueuedSynchronizer.java:2081)
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:1323)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:306)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:33)
- locked <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
at org.littlestar.monitor.task.Task.getConnection(Task.java:246)
at org.littlestar.monitor.task.linux.LogFileMonitor.execute(LogFileMonitor.java:110)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
- locked <0x000000070d4b4738> (a java.lang.Object)
"Quartz-Job-Scheduler_Worker-2" #48 ...
java.lang.Thread.State: BLOCKED (on object monitor)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:32)
- waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
at org.littlestar.monitor.task.Task.getConnection(Task.java:246)
at org.littlestar.monitor.task.linux.LogFileMonitor.execute(LogFileMonitor.java:110)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
- locked <0x000000070d4bbf28> (a java.lang.Object)
"SwingWorker-pool-2-thread-1" #80 ...
java.lang.Thread.State: BLOCKED (on object monitor)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:32)
- waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
at org.littlestar.monitor.task.Task.getConnection(Task.java:246)
at org.littlestar.monitor.task.linux.LinuxFileSystemViewer.executeInBackground(LinuxFileSystemViewer.java:91)
at org.littlestar.monitor.task.ToolkitExt$BackgroundWorker.doInBackground(ToolkitExt.java:119)
at javax.swing.SwingWorker$1.call(java.desktop@11.0.7/SwingWorker.java:304)
at java.util.concurrent.FutureTask.run(java.base@11.0.7/FutureTask.java:264)
at javax.swing.SwingWorker.run(java.desktop@11.0.7/SwingWorker.java:343)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
Locked ownable synchronizers:
- <0x000000070d4cab60> (a java.util.concurrent.ThreadPoolExecutor$Worker)
从thread dump可以看出"SwingWorker-pool-2-thread-1"和Quartz-Job-Scheduler_Worker-2" 线程都处于阻塞状态( java.lang.Thread.State: BLOCKED),被阻塞的原因是无法获得需要的锁(waiting to lock …)。这两个线程在执行JschConnectorEntity.getConnection方法时因为无法获得锁(0x000000070d4a5a80)的锁而进入阻塞状态, 让出CPU执行:
"SwingWorker-pool-2-thread-1" #80 ...
java.lang.Thread.State: BLOCKED (on object monitor)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:32) -->
- waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity) --> 等待锁0x000000070d4a5a80
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
...
"Quartz-Job-Scheduler_Worker-2" #48 ...
java.lang.Thread.State: BLOCKED (on object monitor)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:32) -->
- waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity) --> 等待锁0x000000070d4a5a80
...
waiting to lock <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity) 说明等待的锁的地址是0x000000070d4a5a80,锁是一个org.littlestar.monitor.entity.JschConnectorEntity对象, 这是一个对象锁(synchronized),这个锁被谁持有呢?我们通过锁地址0x000000070d4a5a80进行查找可以看到持有该锁定的是"Quartz-Job-Scheduler_Worker-1"线程:
"Quartz-Job-Scheduler_Worker-1" #47...
...
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:33) -->
- locked <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity) --> 持有0x000000070d4a5a80
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:1)
at org.littlestar.monitor.task.Task.getConnection(Task.java:246)
...
“Quartz-Job-Scheduler_Worker-1” 正在执行JschConnectorEntity.getConnection()方法, 并持有0x000000070d4a5a80锁, 看代码(JschConnectorEntity.java:33)可以看到JschConnectorEntity.getConnection()是被synchronized关键字修饰的方法, 想要执行getConnection()方法, 首先需要获取getConnection()方法的对象的对象锁:
public class JschConnectorEntity extends ConnectorEntity<Session> {
...
@Override
public synchronized Session getConnection() throws Exception {
...
}
...
}
“Quartz-Job-Scheduler_Worker-1” 进入了同步代码块(synchornized block), 持有JschConnectorEntity对象的对象锁, 因为对象锁是互斥的,所以 “Quartz-Job-Scheduler_Worker-1” 和"Quartz-Job-Scheduler_Worker-1"无法获得锁,执行getConnection()被阻塞这是很正常的,Java线程同步机制就是这么设计的。问题在于"Quartz-Job-Scheduler_Worker-1"为什么一直持有0x000000070d4a5a80锁, 不释放呢?继续分析Quartz-Job-Scheduler_Worker-1线程:
"Quartz-Job-Scheduler_Worker-1" #47 ...
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
- parking to wait for <0x000000070d4b4768> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.7/AbstractQueuedSynchronizer.java:2081)
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:1323)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:306)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)
at org.littlestar.monitor.entity.JschConnectorEntity.getConnection(JschConnectorEntity.java:33)
- locked <0x000000070d4a5a80> (a org.littlestar.monitor.entity.JschConnectorEntity)
可以看出Quartz-Job-Scheduler_Worker-1进入getConnection后,调用borrowObject方法,而borrowObject又调用了takeFirst, 该线程是在takeFirst方法调用了ConditionObject.await()后开始进入等待(WAITING)状态,Condition使用的是LockSupport实现的线程等待和唤醒,而非传统的Object.wait和Object.notify。通常,我们只需要看到LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:1323)源代码就可以知道Quartz-Job-Scheduler_Worker-1线程到底在等什么,而再往上的代码其实是await()的底层实现:
class LinkedBlockingDeque<E> extends AbstractQueue<E>
implements Deque<E>, Serializable {
...
/** Condition for waiting takes */
private final Condition notEmpty;
...
lock = new InterruptibleReentrantLock(fairness);
notEmpty = lock.newCondition();
...
/**
* Unlinks the first element in the queue, waiting until there is an element
* to unlink if the queue is empty.
*
* @return the unlinked element
* @throws InterruptedException if the current thread is interrupted
*/
public E takeFirst() throws InterruptedException {
lock.lock();
try {
E x;
while ( (x = unlinkFirst()) == null) {
// --> 等待其它线程执行notEmpty.signal()结束该线程的等待状态;
notEmpty.await(); //--> LinkedBlockingDeque.java:1323
}
return x;
} finally {
lock.unlock();
}
}
}
结合所有的代码一起,诊断的逻辑大概是这样的:
1). 线程进入了同步方法JschConnectorEntity.getConnection(JschConnectorEntity.java:33), 持有了对象锁
2). getConnection使用apache commons pool2的(GenericObjectPool)实现的一个连接池JSchSessionPool;
3). 这里你不需要关心JSchSessionPool的细节, 这是我自己的代码, 你可以把他看成一个GenericObjectPool(的实现类);
4). getConnection调用GenericObjectPool.borrowObject()从线程池中获取一个连接;
public class JschConnectorEntity extends ConnectorEntity<Session> {
@Override
public synchronized Session getConnection() throws Exception {
final JSchSessionPool pool = getSessionPool();
Session session = pool.borrowObject();
return session;
}
}
5). 继续看GenericObjectPool.borrowObject()方法;
6). 通过分析borrowObject()代码可知,当线程池没有空闲对象时,borrowObject()方法会进入等待, borrowObject() -> takeFirst() -> notEmpty.await();
public class GenericObjectPool<T> extends BaseGenericObjectPool<T>
implements ObjectPool<T>, GenericObjectPoolMXBean, UsageTracking<T> {
...
private final LinkedBlockingDeque<PooledObject<T>> idleObjects;
idleObjects = new LinkedBlockingDeque<>(config.getFairness());
@Override
public T borrowObject() throws Exception {
return borrowObject(getMaxWaitDuration());
}
// 因为是第3方库, 不是自己的代码, 要了解意图, 最快捷的方法是阅读代码注释;
/**
* ...
* <p>
* If the pool is exhausted (no available idle instances and no capacity to
* create new ones), this method will either block (if
* {@link #getBlockWhenExhausted()} is true) or throw a
* {@code NoSuchElementException} (if
* {@link #getBlockWhenExhausted()} is false). The length of time that this
* method will block when {@link #getBlockWhenExhausted()} is true is
* determined by the value passed in to the {@code borrowMaxWaitMillis}
* parameter.
* </p>
* <p>
* When the pool is exhausted, multiple calling threads may be
* simultaneously blocked waiting for instances to become available. A
* "fairness" algorithm has been implemented to ensure that threads receive
* available instances in request arrival order.
* </p>
*/
public T borrowObject(final Duration borrowMaxWaitDuration) throws Exception {
...
p = idleObjects.takeFirst();
...
}
}
7). 知道了等待原因是连接池没有空闲连接可用,造成的原因要么就是线程池配置太小了; 要么就是从线程池借了(borrowObject)线程,但没有将线程(returnObject)归还给线程池;
8). 通常连接池设置再小,只要使用完后返回连接池, 不会等待如此长时间, 重点放在排查连接归还问题,搜索执行了getConnection的代码,检查并补上缺失的归还连接池代码后问题解决。
一些个人心得
- 一个线程处于等待(WAITING) 或者被阻塞(BLOCKED )都是线程执行过程中正常的状态, 除了检测到死锁, 对于一个thread dump, 不能简单的通过线程状态来判断一个线程是否有问题, thread dump是一个瞬时状态(快照), 一个线程你执行thread dump时是阻塞的, 下一秒可能就因为满足条件被唤醒继续执行了, 所以单靠线程状态说明任何问题。一个线程线程长期处于BLOCKED状态是值得怀疑的,因为thread dump是瞬时值,你通常需要收集多个thread dump来进行比对验证, 一些第三方工具更直观, 特别是你可以进行线上诊断时, 例如开源的VisualVM:
上面标蓝色的线程,通常表示处于BLOCKED状态的线程,黄色和橙色则通常表示处于WAITING状态的线程。一个线程长时间处于BLOCKED状态,通常是因为长时间无法获得锁,如上图Quartz-Job-Scheduler_Worker-2和SwingWorker-pool-2-thread-1线程,都是一条纯蓝色线条,说明从VisualVM监控开始,一直都是BLOCKED状态,从后面的Runing的时间0ms也可以看出。另外,VisualVM的插件"Threads Inspector"也非常推荐,你只要勾选你关心的线程,就能生成对应线程的call stack,非常好用。
对于thread dump分析,因为跟源代码密切相关,通常要有源代码,一些开源的库都会提供源代码包(*-source.jar)。thread dump需要对要分析的软件或者使用的库有比较深入的了解, 有时候你通过故障现象,就能猜到到大概是什么线程或者方法有问题,分析thread dump本身没太多通用的技巧,跟多是对代码的逻辑进行分析,通常软件的开发者在分析上更有先天的优势。
一个线程的stack trace都会有对应源代码的位置, 例如:
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223) --> GenericObjectPool.java的第223行。
我们通常使用一些工具来阅读源代码以提高效率, 如Eclipse,把需要jar文件和源代码文件jar文件都放到一个项目内,通过Eclipse的功能,就非常方便的对源代码进行阅读, 以Eclipse为例,在java文件写入: org.apache.commons.pool2.impl.GenericObjectPool, 选择该类/方法/变量/… 右键选"Open Declaration", 就能够导航到对应的类和方法的源代码定义。
- 堆栈信息(stack trace)调用的顺序是从下到上的, 也就是线程从堆栈最底的方法开始的调用, 一直到stack trace最顶部的方法;
- stack trace最底部不代表线程开始执行的位置, thread dump是看不出线程是怎么启动的(启动线程的代码位置), 因为执行线程启动方法是不会阻塞的, 代码会继续往下执行,创建线程的方法通常执行结束就出栈了。观察所有线程的stack trace, 最底部执行的都是run()方法,这是因为java创建的线程,无论是通过Thread, 还是其它什么线程池的方式, 线程执行开始的地方通常都是Runnable接口的run()方法。
Runnable.run() - When an object implementing interface Runnable is used to create a thread, starting the thread causes
the object's run method to be called in that separately executing thread.
例如: 上面的SwingWorker-pool-2-thread-1, 线程是在一个Swing菜单事件响应方法中启动的:
SwingWorker<Object, Void> backgroundWorker;
...
backgroundWorker.execute();
从stack trace最底部开始, 到调用你写的代码之前, 实际上都是SwingWorker内部的调用, 通常我们不需要关心这部分内容, 我们分析一个线程调用堆栈,只需要从底部的第一条自己编写的代码(非JVM的代码)开始。当然你有兴趣从SwingWorker.execute()方法一直跟踪下去,也能走出下列调用:
...
at javax.swing.SwingWorker$1.call(java.desktop@11.0.7/SwingWorker.java:304)
at java.util.concurrent.FutureTask.run(java.base@11.0.7/FutureTask.java:264)
at javax.swing.SwingWorker.run(java.desktop@11.0.7/SwingWorker.java:343)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
- 线程问题的诊断,特别是多线程,通常逃不掉锁的问题,thread dump包含了线程持有锁和等待锁的信息,根据锁类型不同,在thread dump对锁的信息的显示也稍有不同, 主要有两类:
(1). 一个是传统的对象锁(synchronized): locked … /waiting to lock …
进入了同步代码块, 就说明持有了该对象锁, 如:前面提到的synchronized方法, 或者一些同步块,如:
Object lock = new Object();
synchronized(lock) {
//...
}
这类锁定的持有线程,都会在stack trace对应的位置标识锁信息: "locked <锁定地址> (锁对象的类) "
当线程尝试进入同步代码块,但是无法获取锁,线程stack trace对应代码位置将标识: waiting to lock <锁定地址> (锁对象的类)
(2). 而另一个是线程持有同步器Synchronizer锁(AbstractOwnableSynchronizer的子类, 通常是AQS):
这类锁持有信息会在线程的调用堆栈的下面显示,如, 上例"SwingWorker-pool-2-thread-1"线程
Locked ownable synchronizers:
- <0x000000070f5049a8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
锁类型是ThreadPoolExecutor的内部类Worker, Worker类继承AQS(AbstractQueuedSynchronizer),而AQS则继承自AbstractOwnableSynchronizer:
private final class Worker
extends AbstractQueuedSynchronizer ----> extends AbstractOwnableSynchronizer
implements Runnable {
...
}
加锁定是在ThreadPoolExecutor.runWorker方法内, 没有对象锁定这么直观一眼就能找到位置:
final void runWorker(Worker w) {
...
w.lock();
...
}
对于这类锁定, 我们更常用的是可重入锁(ReentrantLock), ReentrantLock内部使用使用了AQS实现lock/unlock。举个简单例子:
class Test3 {
public static void main(String[] args) throws Exception {
ReentrantLock lock = new ReentrantLock();
一个线程持有锁定后进入等待
Thread thread1 = new Thread(() -> {
lock.lock();
try {
TimeUnit.MINUTES.sleep(4);
} catch (InterruptedException e) {
} finally {
lock.unlock();
}
});
thread1.setName("Lock-holder");
thread1.start();
一个线程同时尝试获取锁定
Thread thread2 = new Thread(() -> {
lock.lock();
System.out.println("Got lock~");
});
thread2.setName("Lock-waiter");
thread2.start();
}
}
对应线程的thread-dump:
"Lock-holder" #21 prio=5 os_prio=0 cpu=15.63ms elapsed=22.07s tid=0x000001df30756800 nid=0x3f7c waiting on condition [0x0000004b1f5ff000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(java.base@11.0.7/Native Method)
at java.lang.Thread.sleep(java.base@11.0.7/Thread.java:339)
at java.util.concurrent.TimeUnit.sleep(java.base@11.0.7/TimeUnit.java:446)
at org.littlestar.helper.Test3.lambda$0(Test3.java:18)
at org.littlestar.helper.Test3$$Lambda$1/0x0000000800060840.run(Unknown Source)
at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
Locked ownable synchronizers:
- <0x000000071b8767c0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) -->持有锁0x000000071b8767c0, 类型是NonfairSync,继承AQS
"Lock-waiter" #22 prio=5 os_prio=0 cpu=0.00ms elapsed=22.07s tid=0x000001df30755000 nid=0x1b10 waiting on condition [0x0000004b1f6fe000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
- parking to wait for <0x000000071b8767c0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) -->等待获取锁0x000000071b8767c0
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.7/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.7/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantLock.lock(java.base@11.0.7/ReentrantLock.java:267)
at org.littlestar.helper.Test3.lambda$1(Test3.java:27)
at org.littlestar.helper.Test3$$Lambda$2/0x0000000800060c40.run(Unknown Source)
at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
Locked ownable synchronizers:
- None