线上部署了一个java web服务到tomcat中,前面有nginx进行轮训。发现一个问题,总是有个一个tomcat莫名其妙的假死,通过ip访问服务一直无响应。

1、查看磁盘、内存等信息:

查看服务器基本信息,没有发现异常。

2、查看链接数:

$ netstat -natp | awk '{print $6}' | sort | uniq -c | sort -n
1 established)
1 Foreign
11 LISTEN
19 TIME_WAIT
106 CLOSE_WAIT
142 ESTABLISHED

发现close_wait 比较多,但是和高负载的服务器比起来也不算多。看了下其他正常的服务close_wait基本为0,所以就在这里下了很多功夫去查,调linux系统的各种参数...

3、查看gc:

$ jstat -gcutil 3494 1000 1000
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 0.00 2.87 1.28 100.00 31420 140.865 31399 4345.929 4486.793
0.00 0.00 2.94 1.28 100.00 31420 140.865 31399 4345.929 4486.793
0.00 0.00 2.94 1.28 100.00 31420 140.865 31399 4345.929 4486.793

问题找到了问题,perm space占用100%。再去看日志

ERROR] [org.springframework.boot.web.support.ErrorPageFilter:176] Forwarding to error page from request [/algoAbTest/algoAbTestIndex] due to exception [Cannot deserialize; nested exception is org.springframework.core.serializer.support.SerializationFailedException: Failed to deserialize payload. Is the byte array a result of corresponding serialization for DefaultDeserializer?; nested exception is java.lang.OutOfMemoryError: PermGen space]

既然知道了问题,接下来就是看问题的原因。

1)查看jvm情况:

$ jmap -heap 3494
Attaching to process ID 3494, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.79-b02

using thread-local object allocation.
Parallel GC with 8 thread(s)

Heap Configuration:
MinHeapFreeRatio = 0
MaxHeapFreeRatio = 100
MaxHeapSize = 8417968128 (8028.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 17592186044415 MB
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 21757952 (20.75MB)
MaxPermSize = 85983232 (82.0MB)
G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
capacity = 2804940800 (2675.0MB)
used = 0 (0.0MB)
free = 2804940800 (2675.0MB)
0.0% used
From Space:
capacity = 524288 (0.5MB)
used = 0 (0.0MB)
free = 524288 (0.5MB)
0.0% used
To Space:
capacity = 524288 (0.5MB)
used = 0 (0.0MB)
free = 524288 (0.5MB)
0.0% used
PS Old Generation
capacity = 5611978752 (5352.0MB)
used = 71781144 (68.4558334350586MB)
free = 5540197608 (5283.544166564941MB)
1.2790701314472832% used
PS Perm Generation
capacity = 85983232 (82.0MB)
used = 85983232 (82.0MB)
free = 0 (0.0MB)
100.0% used

32283 interned Strings occupying 3838824 bytes.

可以看到,perm space使用了80多M,而默认分配了50M。所以调整tomcat的参数即可

2)解决:

在catalina.sh中加入:

CATALINA_OPTS="$CATALINA_OPTS -Xms20480m -Xmx20480m -Xss2m -XX:PermSize=512M -XX:MaxNewSize=512m -XX:MaxPermSize=512m -XX:NewRatio=2"
CATALINA_OPTS="$CATALINA_OPTS -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Xloggc:/data/logs/tomcat/gc.log"
JMX_REMOTE="-Dcom.sun.management.jmxremote.port=1999 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
CATALINA_OPTS="$CATALINA_OPTS $JMX_REMOTE"

3)进一步分析:

之前看到的close_wait都是由于服务无法响应正常请求造成的连接无法关闭。