事故现象

 

线上服务不定点的不能使用, 查日志发现resin服务正在重启。 由于resin所在的服务有web监控,如果web服务没有相应,则会重启resin服务。 那到底是什么导致了无法请求到web服务呢(60秒没有相应)

 

故障定位
 
1, 首先查看相应的日志:
/data/logs/chewenwww/stdout.log   正常信息日志
/data/logs/chewenwww/stderr.log   异常信息日志
/data/server/resin/log/jvm-default.log  resin的日志
查看resin的日志发现在车resin重启之前有如下日志:
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
        at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2230)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:513)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:879)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:837)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:848)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:870)
        at jp.gummy.service.cache.MemCacheManager.get(MemCacheManager.java:88)
        at jp.gummy.tools.common.cache.MCache.get(MCache.java:304)

查看stdout.log 也发现有 zookeeper 重新链接的日志
[03 04:17:02,860 INFO ] [main-SendThread(10.3.63.205:2181)] zookeeper.ClientCnxn - Client session timed out, 
have not heard from server in 21688ms for sessionid 0x2441aa450100345, closing socket connection and attempting reconnect

就此认为是 网络的问题。

后经过排查, 发现网络也没有什么大的问题。

2, 查看慢查询日志

经过查询慢查询日志, 发现有一条根据手机号码查询user对象的语句
    @SQL("select "+ALL_FIELD+" from " + TABLE_NAME
            + " where phoneNum = :1 ")
    public CwUser getCWUserByPhoneNum(String phoneNum);
经过客户端查询分析, 此语句查询很慢。

这个时候以为大概就找到原因了, 可不太明白为什么一条慢查询怎么就导致了服务的变慢,而且查询
日志发现这条语句的执行频率也不是太高。 
故障重现分析
 
1, 将线上的206web服务拿下来,模拟用户点击使用手机号登录, 模拟慢查询的操作。 点击两三次以后发现, 206的web服务不在相应。
2, 增加gc的日志记录, 在/data/server/resin/conf/resin.conf 配置文件中增加gc的日志。如下
      <jvm-arg>-Xms4096m</jvm-arg>
      <jvm-arg>-Xmx4096m</jvm-arg>
      <jvm-arg>-Xdebug</jvm-arg>
      <jvm-arg>-Xss512k</jvm-arg>
      <jvm-arg>-XX:PermSize=256M</jvm-arg>
      <jvm-arg>-XX:MaxPermSize=512m</jvm-arg>
      <jvm-arg>-XX:+PrintGCDetails</jvm-arg>
      <jvm-arg>-XX:+PrintGCTimeStamps</jvm-arg>
      <jvm-arg>-Xloggc:./log/gc.log</jvm-arg>   

3, 查看gc的log发现,在点击完登录以后(手机号为:abc, 密码为:123456)。jvm模型中的年老代持续增加。
   gc日志路径 /data/server/resin/log/gc.log
310.815: [GC [PSYoungGen: 838342K->12272K(1100672K)] 1214764K->406864K(3896896K), 0.0308350 secs] [Times: user=0.09 sys=0.02, real=0.03 secs] 
23268.488: [GC [PSYoungGen: 821872K->16666K(1119360K)] 1216464K->412505K(3915584K), 0.0302600 secs] [Times: user=0.08 sys=0.00, real=0.03 secs] 
24054.566: [GC [PSYoungGen: 849818K->18987K(1111872K)] 1245657K->420547K(3908096K), 0.0275740 secs] [Times: user=0.14 sys=0.01, real=0.03 secs] 
24251.516: [GC [PSYoungGen: 852139K->269605K(735744K)] 1253699K->1161827K(3531968K), 0.4672150 secs] [Times: user=4.31 sys=0.35, real=0.47 secs] 
24254.009: [GC [PSYoungGen: 735717K->460824K(932096K)] 1627939K->1618868K(3728320K), 0.4211430 secs] [Times: user=4.06 sys=0.15, real=0.42 secs] 
24255.917: [GC [PSYoungGen: 926936K->463479K(932096K)] 2084980K->2083985K(3728320K), 0.5223580 secs] [Times: user=4.91 sys=0.28, real=0.52 secs] 
24258.026: [GC [PSYoungGen: 929591K->465319K(932096K)] 2550097K->2549705K(3728320K), 0.5334350 secs] [Times: user=4.93 sys=0.35, real=0.53 secs] 
24260.064: [GC [PSYoungGen: 931431K->465972K(932096K)] 3015817K->3015163K(3728320K), 0.5280580 secs] [Times: user=4.89 sys=0.36, real=0.53 secs] 
24260.592: [Full GC [PSYoungGen: 465972K->217778K(932096K)] [PSOldGen: 2549191K->2796224K(2796224K)] 3015163K->3014002K(3728320K) [PSPermGen: 127322K->127322K(264832K)], 4.9594160 secs] [Times: user=4.88 sys=0.08, real=4.95 secs] 
24266.419: [Full GC [PSYoungGen: 466112K->466112K(932096K)] [PSOldGen: 2796224K->2796224K(2796224K)] 3262336K->3262336K(3728320K) [PSPermGen: 127322K->127322K(262144K)], 4.8840250 secs] [Times: user=4.87 sys=0.01, real=4.88 secs] 
24271.304: [Full GC [PSYoungGen: 466112K->340903K(932096K)] [PSOldGen: 2796224K->2796223K(2796224K)] 3262336K->3137127K(3728320K) [PSPermGen: 127322K->125606K(262144K)], 6.4299960 secs] [Times: user=6.42 sys=0.01, real=6.43 secs] 
24278.162: [Full GC [PSYoungGen: 466112K->466112K(932096K)] [PSOldGen: 2796223K->2796223K(2796224K)] 3262335K->3262335K(3728320K) [PSPermGen: 125607K->125607K(262144K)], 4.9364420 secs] [Times: user=4.93 sys=0.01, real=4.94 secs] 
24283.109: [Full GC [PSYoungGen: 466112K->466112K(932096K)] [PSOldGen: 2796224K->2796224K(2796224K)] 3262336K->3262336K(3728320K) [PSPermGen: 125607K->125607K(262144K)], 4.9332780 secs] [Times: user=4.93 sys=0.01, real=4.93 secs] 
24288.042: [Full GC [PSYoungGen: 466112K->449133K(932096K)] [PSOldGen: 2796224K->2796223K(2796224K)] 3262336K->3245357K(3728320K) [PSPermGen: 125607K->125606K(262144K)], 6.4916020 secs] [Times: user=6.48 sys=0.01, real=6.49 secs] 
24294.579: [Full GC [PSYoungGen: 466112K->461216K(932096K)] [PSOldGen: 2796223K->2796223K(2796224K)] 3262335K->3257440K(3728320K) [PSPermGen: 125617K->125617K(262144K)], 6.5066020 secs] [Times: user=6.50 sys=0.01, real=6.50 secs] 
24301.100: [Full GC [PSYoungGen: 466112K->466112K(932096K)] [PSOldGen: 2796223K->2796223K(2796224K)] 3262335K->3262335K(3728320K) [PSPermGen: 125617K->125617K(262144K)], 4.9549900 secs] [Times: user=4.94 sys=0.01, real=4.95 secs] 
24306.056: [Full GC [PSYoungGen: 466112K->466112K(932096K)] [PSOldGen: 2796223K->2796223K(2796224K)] 3262335K->3262335K(3728320K) [PSPermGen: 125617K->125617K(262144K)], 4.9524090 secs] [Times: user=4.94 sys=0.01, real=4.95 secs] 
24311.013: [Full GC [PSYoungGen: 466112K->466112K(932096K)] [PSOldGen: 2796223K->2796223K(2796224K)] 3262335K->3262335K(3728320K) [PSPermGen: 125617K->125617K(262144K)], 4.9518770 secs] [Times: user=4.94 sys=0.02, real=4.95 secs] 

发现年老代的占用空间几乎达到 100%, PSOldGen: 2796223K->2796223K(2796224K)]。 这个很让人费解
日志格式说明参考: gc.jpeg 和 fgc.jpeg

4, 再次模拟上述测试环境, 在 stderr.log中发现如下:

Caused by: java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1632)
    at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1409)
    at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2886)
    at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:476)
    at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2581)
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1757)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2171)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2568)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2113)
    at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2275)
    at org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
    at org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
    at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:648)
    at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:591)
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:641)
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:670)
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:678)
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:710)
    at net.paoding.rose.jade.provider.jdbc.JdbcImpl.query(JdbcImpl.java:67)
    at net.paoding.rose.jade.provider.jdbc.JdbcDataAccess.select(JdbcDataAccess.java:79)
    at net.paoding.rose.jade.core.SQLThreadLocalWrapper.select(SQLThreadLocalWrapper.java:48)
    at net.paoding.rose.jade.core.SelectOperation.execute(SelectOperation.java:66)
    at net.paoding.rose.jade.core.JadeDaoInvocationHandler.invoke(JadeDaoInvocationHandler.java:128)
    at $Proxy35.getCWUserByPhoneNum(Unknown Source)
    at com.chewen.chewen_user.biz.impl.CwUserBizImpl.getByPhoneNum(CwUserBizImpl.java:68)

由此猜测是否是查询的结果是很多条记录???????  导致堆栈溢出呢 
故障推论验证
 
1, 通过mysql客户端发现, 查询 
mysql>  select count(*) from user where phoneNum="abc" ;
+----------+
| count(*) |
+----------+
|  5115771 |
+----------+

有大概五百多万条记录,将查询条件更改为其他字符时, 发现结果一样
mysql>  select count(*) from user where phoneNum="abcdef" ;
+----------+
| count(*) |
+----------+
|  5115771 |
+----------+

推测, 难道是mysql查询时将字符串转为Bigint格式的时候转为0了?
mysql>  select count(*) from user where phoneNum=0;
+----------+
| count(*) |
+----------+
|  5115771 |
+----------+



结果一样。

到此, 真相大白了, 因为数据表结构中phoneNum是Bigint类型的, 而查询的时候如果是字符串则将
字符串转化为了Bigint格式, 转为0了。导致了上百万条记录的返回, 导致jvm堆栈溢出。