Arthas 是一款线上监控诊断产品,通过全局视角实时查看应用 load、内存、gc、线程的状态信息,并能在不修改应用代码的情况下,对业务问题进行诊断,包括查看方法调用的出入参、异常,监测方法执行耗时,类加载信息等,大大提升线上问题排查效率。
https://arthas.aliyun.com/doc/
1. 启动arthas
在命令行下面执行(使用和目标进程一致的用户及jdk启动,否则可能attach失败):
java -jar arthas-boot.jar
- 执行该程序的用户需要和目标进程具有相同的权限。比如以
admin
用户来执行: sudo su admin && java -jar arthas-boot.jar 或 sudo -u admin -EH java -jar arthas-boot.jar 。 - 如果attach不上目标进程,可以查看
~/logs/arthas/
目录下的日志。 - java -jar arthas-boot.jar -h 打印更多参数信息。
选择应用java进程:
$ java -jar arthas-boot.jar
[1]: 35542
[2]: 71560 math-game.jar
math-game
进程是第2个,则输入2,再输入 回车/enter
。Arthas会attach到目标进程上,并输出日志。
2. 通过stack命令来获取指定方法被调用的执行路径、堆栈
$ stack demo.MathGame primeFactors
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 36 ms.
ts=2018-12-04 01:32:19;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
@demo.MathGame.run()
at demo.MathGame.main(MathGame.java:16)
据条件表达式来过滤
$ stack demo.MathGame primeFactors 'params[0]<0' -n 2
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 30 ms.
ts=2018-12-04 01:34:27;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
@demo.MathGame.run()
at demo.MathGame.main(MathGame.java:16)
ts=2018-12-04 01:34:30;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
@demo.MathGame.run()
at demo.MathGame.main(MathGame.java:16)
Command execution times exceed limit: 2, so command will exit. You can set it with -n option.
据执行时间来过滤
$ stack demo.MathGame primeFactors '#cost>5'
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 35 ms.
ts=2018-12-04 01:35:58;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
@demo.MathGame.run()
at demo.MathGame.main(MathGame.java:16)
3. watch 能观察到的范围为:返回值、抛出异常、入参,支持 OGNL 表达式进行变量的查看
通过watch命令来查看 demo.MathGame#primeFactors 函数的入参、属性值、返回值, x为深度,b调用前,s返回后,n监控次数
$ watch demo.MathGame primeFactors "{params,target,returnObj}" -x 2 -b -s -n 2
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 46 ms.
ts=2018-12-03 19:29:54; [cost=0.01696ms] result=@ArrayList[
@Object[][
@Integer[1],
],
@MathGame[
random=@Random[java.util.Random@522b408a],
illegalArgumentCount=@Integer[13038],
],
null,
]ts=2018-12-03 19:29:54; [cost=4.277392ms] result=@ArrayList[
@Object[][
@Integer[1],
],
@MathGame[
random=@Random[java.util.Random@522b408a],
illegalArgumentCount=@Integer[13038],
],
@ArrayList[
@Integer[2],
@Integer[2],
@Integer[2],
@Integer[5],
@Integer[5],
@Integer[73],
@Integer[241],
@Integer[439],
],
]
4. trace 方法内部调用路径,并输出方法路径上的每个节点上耗时
$ trace demo.MathGame run
Press Q or Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 28 ms.
`---ts=2019-12-04 00:45:08;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
`---[0.617465ms] demo.MathGame:run()
`---[0.078946ms] demo.MathGame:primeFactors() #24 [throws Exception]
`---ts=2019-12-04 00:45:09;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
`---[1.276874ms] demo.MathGame:run()
`---[0.03752ms] demo.MathGame:primeFactors() #24 [throws Exception]
注意,trace 命令只会 trace 匹配到的函数里的子调用,并不会向下 trace 多层。因为 trace 是代价比较贵的,多层 trace 可能会导致最终要 trace 的类和函数非常多。
可以用正则表匹配路径上的多个类和函数,一定程度上达到多层 trace 的效果。
trace -E com.test.ClassA|org.test.ClassB method1|method2
|method3
[arthas@2236726]$ trace --skipJDKMethod false -E com.example.cloudbus.CloudBusApplication tkkk22|tkkk31|tkkk32|tkkk33
Press Q or Ctrl+C to abort.
Affect(class count: 1 , method count: 4) cost in 180 ms, listenerId: 1
`---ts=2023-07-14 11:41:00;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=org.springframework.boot.loader.LaunchedURLClassLoader@7daf6ecc
`---[17624.395668ms] com.example.cloudbus.CloudBusApplication:tkkk22()
+---[0.053865ms] java.lang.System:currentTimeMillis() #30
+---[min=0.01425ms,max=0.037236ms,total=0.308621ms,count=17] java.util.Random:<init>() #31
+---[min=0.004275ms,max=0.021353ms,total=0.101463ms,count=17] java.util.Random:nextInt() #57
+---[min=1087.430621ms,max=1155.758649ms,total=17622.798142ms,count=16] com.example.cloudbus.CloudBusApplication:tkkk31() #32
| `---[min=1087.425581ms,max=1155.731992ms,total=17622.68804ms,count=16] com.example.cloudbus.CloudBusApplication:tkkk31()
| +---[min=0.058004ms,max=0.170134ms,total=1.453502ms,count=16] java.io.PrintStream:println() #38
| `---[min=1087.314568ms,max=1155.563051ms,total=17621.076907ms,count=16] com.example.cloudbus.CloudBusApplication:tkkk32() #39
| `---[min=1087.309887ms,max=1155.537419ms,total=17620.966863ms,count=16] com.example.cloudbus.CloudBusApplication:tkkk32()
| +---[min=0.015758ms,max=0.056775ms,total=0.4743ms,count=16] java.io.PrintStream:println() #43
| `---[min=1.01807ms,max=9.912311ms,total=17578.455042ms,count=15984] com.example.cloudbus.CloudBusApplication:tkkk33() #46
| `---[min=1.014606ms,max=9.900366ms,total=17483.734923ms,count=15984] com.example.cloudbus.CloudBusApplication:tkkk33()
| `---[min=1.012647ms,max=9.895729ms,total=17417.478916ms,count=15984] java.lang.Thread:sleep() #56
+---[0.008009ms] java.lang.StringBuilder:<init>() #34
+---[min=0.002087ms,max=0.013849ms,total=0.026167ms,count=3] java.lang.StringBuilder:append() #57
+---[0.002482ms] java.lang.System:currentTimeMillis() #57
+---[0.004255ms] java.lang.StringBuilder:toString() #57
`---[0.067332ms] java.io.PrintStream:println() #57
5. monitor 对匹配的类、方法、条件表达式进行监控(非实时)
-c 是统计周期内的执行次数、成功次数、失败次数、平均耗时、失败率, 直到用户输入 Ctrl+C 为止
$ monitor -c 5 demo.MathGame primeFactors
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 94 ms.
timestamp class method total success fail avg-rt(ms) fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:38 demo.MathGame primeFactors 5 1 4 1.15 80.00%
timestamp class method total success fail avg-rt(ms) fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:43 demo.MathGame primeFactors 5 3 2 42.29 40.00%
timestamp class method total success fail avg-rt(ms) fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:48 demo.MathGame primeFactors 5 3 2 67.92 40.00%
timestamp class method total success fail avg-rt(ms) fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:53 demo.MathGame primeFactors 5 2 3 0.25 60.00%
timestamp class method total success fail avg-rt(ms) fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:58 demo.MathGame primeFactors 1 1 0 0.45 0.00%
timestamp class method total success fail avg-rt(ms) fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:07:03 demo.MathGame primeFactors 2 2 0 3182.72 0.00%
6. profiler 通过对应用不断的采样,然后把收集到的采样结果生成火焰图
$ profiler start
Started [cpu] profiling
$ profiler stop --format html
profiler output file: /tmp/test/arthas-output/20230517-111550.html
另外,profiler start 支持--event 参数,常用的就是cpu、alloc、wall,默认为cpu仅统计cpu消耗的采样, 而wall表示挂钟时间,包括running、sleeping、blocked时间,命令示例:
$ profiler start --event wall
Started [wall] profiling
$ profiler stop --format html
7. stop 关闭 Arthas 服务端,所有 Arthas 客户端全部退出。
[arthas@16960]$ stop
Resetting all enhanced classes ...
Affect(class count: 0 , method count: 0) cost in 1 ms, listenerId: 0
Arthas Server is going to shutdown...
[arthas@16960]$ session (dd8ad595-1157-4728-8038-f3c29e46a4a8) is closed because server is going to shutdown.
[root@xxxx]$