我们线上有个系统是有若干个服务组成,服务之间通过thrift rpc进行通信,在调用rpc服务的时候加了一层hystrix,目的是防止“雪崩”。这样的系统架构算是一个比较“中规中矩”的了,而且比较成熟。

初期,系统很稳定没有发生过大的事故。随着线上流量增加,偶尔发生了几次hystrix熔断器打开后无法关闭的现象——hystrix熔断器打开的原因是由于rpc服务不稳定,但是当rpc服务稳定下来后,理论上hystrix的熔断器会自动关闭。

发现问题后,开始排查,开始检查客户端内存、gc、连接数等发现都正常,再检查服务端的一些情况,也都很正常。无奈之下,打开hystrix的dashboard,发现熔断器一直打开,动态设置强制关闭熔断器也不管用(官方说法是当发现熔断器处于打开状态是,hystrix.command.ttbrain_predict_command.circuitBreaker.forceClosed=true是不生效的)

记一次线上hystrix的事故_f5

可以发现open那个command,线程的active一直是100(hystrix设置的coresize也是100),动态修改hystrix的线程数(hystrix.threadpool.ttbrain_predict_threadpoolKey.coreSize=101),改成了101后就好了,熔断器立马关闭。再次把线程数修改成100,熔断器就又一直打开。

查了下文档,hystrix会检查:

  1. 判断熔断器(circuit-breaker)是否打开,如果打开跳到步骤8,进行降级策略,如果关闭进入步骤.
  2. 判断线程池/队列/信号量是否跑满,如果跑满进入降级步骤8,否则继续后续步骤.

这就解释了为什么线程设置成100后熔断器一直被打开;所以,说明hystrix的线程一直“没有回收”,之前的100个线程一直在hang住了。

        接下来打开客户端的jmx,用visualvm练上去查看线程,发现hystrix的一个command线程一直是在running状态。同样,使用jstack查看线程,会发现有如下线程:

java.net.SocketInputStream.socketRead0(Native Method)

查看所有hystrix相关线程,都在running:

"hystrix-ttbrain_predict_threadpoolKey-100" daemon prio=10 tid=0x00002ae5f80d0000 nid=0x35e8 runnable [0x00002ae6be29f000]
"hystrix-ttbrain_predict_threadpoolKey-99" daemon prio=10 tid=0x00002ae5d41d4000 nid=0x35e5 runnable [0x00002ae6bdc9b000]
"hystrix-ttbrain_predict_threadpoolKey-98" daemon prio=10 tid=0x00002ae5fc087000 nid=0x35e1 runnable [0x00002ae6bd498000]
"hystrix-ttbrain_predict_threadpoolKey-97" daemon prio=10 tid=0x00002ae5f80c7000 nid=0x35dc runnable [0x00002ae6bca92000]
"hystrix-ttbrain_predict_threadpoolKey-96" daemon prio=10 tid=0x00002ae65c1a5800 nid=0x35d9 runnable [0x00002ae6bc490000]
"hystrix-ttbrain_predict_threadpoolKey-95" daemon prio=10 tid=0x00002ae5d41ca800 nid=0x35d2 runnable [0x00002ae6bb689000]
"hystrix-ttbrain_predict_threadpoolKey-94" daemon prio=10 tid=0x00002ae5fc14f800 nid=0x35d0 runnable [0x00002ae6bb287000]
"hystrix-ttbrain_predict_threadpoolKey-93" daemon prio=10 tid=0x00002ae6080fc800 nid=0x35cf runnable [0x00002ae6bb086000]
"hystrix-ttbrain_predict_threadpoolKey-92" daemon prio=10 tid=0x00002ae5d41c7000 nid=0x35ce runnable [0x00002ae6bae84000]
"hystrix-ttbrain_predict_threadpoolKey-91" daemon prio=10 tid=0x00002ae5f80bb000 nid=0x35c5 runnable [0x00002ae6b9c7b000]
"hystrix-ttbrain_predict_threadpoolKey-90" daemon prio=10 tid=0x00002ae5d41be800 nid=0x35bc runnable [0x00002ae6b9478000]
"hystrix-ttbrain_predict_threadpoolKey-89" daemon prio=10 tid=0x00002ae5fc171000 nid=0x35b1 runnable [0x00002ae6b9075000]
"hystrix-ttbrain_predict_threadpoolKey-88" daemon prio=10 tid=0x00002ae5f80b1800 nid=0x35a9 runnable [0x00002ae6b8671000]
"hystrix-ttbrain_predict_threadpoolKey-87" daemon prio=10 tid=0x00002ae65c19e000 nid=0x35a5 runnable [0x00002ae6b7e6c000]
"hystrix-ttbrain_predict_threadpoolKey-86" daemon prio=10 tid=0x00002ae5d41b7000 nid=0x35a1 runnable [0x00002ae6b7669000]
"hystrix-ttbrain_predict_threadpoolKey-85" daemon prio=10 tid=0x00002ae6080f4000 nid=0x359d runnable [0x00002ae6b6e64000]
"hystrix-ttbrain_predict_threadpoolKey-84" daemon prio=10 tid=0x00002ae5f80a5800 nid=0x34ae runnable [0x00002ae6b5dd1000]
"hystrix-ttbrain_predict_threadpoolKey-83" daemon prio=10 tid=0x00002ae5d41af000 nid=0x34aa runnable [0x00002ae6b55cc000]
"hystrix-ttbrain_predict_threadpoolKey-82" daemon prio=10 tid=0x00002ae5fc095800 nid=0x34a6 runnable [0x00002ae6b4dc9000]
"hystrix-ttbrain_predict_threadpoolKey-81" daemon prio=10 tid=0x00002ae5f809c000 nid=0x34a2 runnable [0x00002ae6b45c5000]
"hystrix-ttbrain_predict_threadpoolKey-80" daemon prio=10 tid=0x00002ae65c196800 nid=0x349e runnable [0x00002ae6b3dc1000]
"hystrix-ttbrain_predict_threadpoolKey-79" daemon prio=10 tid=0x00002ae5d41a7800 nid=0x349a runnable [0x00002ae6b35bd000]
"hystrix-ttbrain_predict_threadpoolKey-78" daemon prio=10 tid=0x00002ae608108000 nid=0x3496 runnable [0x00002ae6b2db9000]
"hystrix-ttbrain_predict_threadpoolKey-77" daemon prio=10 tid=0x00002ae5d41a1000 nid=0x3492 runnable [0x00002ae6b25b5000]
"hystrix-ttbrain_predict_threadpoolKey-76" daemon prio=10 tid=0x00002ae5fc0fe000 nid=0x348e runnable [0x00002ae6b1db1000]
"hystrix-ttbrain_predict_threadpoolKey-75" daemon prio=10 tid=0x00002ae5f8091000 nid=0x348a runnable [0x00002ae6b15ad000]
"hystrix-ttbrain_predict_threadpoolKey-74" daemon prio=10 tid=0x00002ae5d4199800 nid=0x3486 runnable [0x00002ae6b0da9000]
"hystrix-ttbrain_predict_threadpoolKey-73" daemon prio=10 tid=0x00002ae5fc18e000 nid=0x3482 runnable [0x00002ae6b05a4000]
"hystrix-ttbrain_predict_threadpoolKey-72" daemon prio=10 tid=0x00002ae5f8087000 nid=0x347e runnable [0x00002ae6afda1000]
"hystrix-ttbrain_predict_threadpoolKey-71" daemon prio=10 tid=0x00002ae65c1bc000 nid=0x347a runnable [0x00002ae6af59d000]
"hystrix-ttbrain_predict_threadpoolKey-70" daemon prio=10 tid=0x00002ae5d4192000 nid=0x3476 runnable [0x00002ae6aed99000]
"hystrix-ttbrain_predict_threadpoolKey-69" daemon prio=10 tid=0x00002ae5d418b000 nid=0x3472 runnable [0x00002ae6ae594000]
"hystrix-ttbrain_predict_threadpoolKey-68" daemon prio=10 tid=0x00002ae5fc40b000 nid=0x346e runnable [0x00002ae6add91000]
"hystrix-ttbrain_predict_threadpoolKey-67" daemon prio=10 tid=0x00002ae5f807c800 nid=0x346a runnable [0x00002ae6ad58d000]
"hystrix-ttbrain_predict_threadpoolKey-66" daemon prio=10 tid=0x00002ae5d4184800 nid=0x3466 runnable [0x00002ae6acd89000]
"hystrix-ttbrain_predict_threadpoolKey-65" daemon prio=10 tid=0x00002ae5fc156000 nid=0x3462 runnable [0x00002ae6ac585000]
"hystrix-ttbrain_predict_threadpoolKey-64" daemon prio=10 tid=0x00002ae5f8073000 nid=0x345e runnable [0x00002ae6abf82000]
"hystrix-ttbrain_predict_threadpoolKey-63" daemon prio=10 tid=0x00002ae65c1b4800 nid=0x345a runnable [0x00002ae6ab77e000]
"hystrix-ttbrain_predict_threadpoolKey-62" daemon prio=10 tid=0x00002ae5d417c800 nid=0x3456 runnable [0x00002ae6aaf79000]
"hystrix-ttbrain_predict_threadpoolKey-61" daemon prio=10 tid=0x00002ae608109800 nid=0x3452 runnable [0x00002ae6aa776000]
"hystrix-ttbrain_predict_threadpoolKey-60" daemon prio=10 tid=0x00002ae5d4162000 nid=0x344e runnable [0x00002ae6a9f72000]
"hystrix-ttbrain_predict_threadpoolKey-59" daemon prio=10 tid=0x00002ae5fc083800 nid=0x344a runnable [0x00002ae6a976e000]
"hystrix-ttbrain_predict_threadpoolKey-58" daemon prio=10 tid=0x00002ae5f806b800 nid=0x3446 runnable [0x00002ae6a8f6a000]
"hystrix-ttbrain_predict_threadpoolKey-57" daemon prio=10 tid=0x00002ae5d4159000 nid=0x3442 runnable [0x00002ae6a8766000]
"hystrix-ttbrain_predict_threadpoolKey-56" daemon prio=10 tid=0x00002ae5fc167000 nid=0x343e runnable [0x00002ae6a7f62000]
"hystrix-ttbrain_predict_threadpoolKey-55" daemon prio=10 tid=0x00002ae5f8063000 nid=0x343a runnable [0x00002ae6a775d000]
"hystrix-ttbrain_predict_threadpoolKey-54" daemon prio=10 tid=0x00002ae65c1e3000 nid=0x3436 runnable [0x00002ae6a6f5a000]
"hystrix-ttbrain_predict_threadpoolKey-53" daemon prio=10 tid=0x00002ae5d4150800 nid=0x3432 runnable [0x00002ae6a6756000]
"hystrix-ttbrain_predict_threadpoolKey-52" daemon prio=10 tid=0x00002ae6081cc000 nid=0x342e runnable [0x00002ae6a5f52000]
"hystrix-ttbrain_predict_threadpoolKey-51" daemon prio=10 tid=0x00002ae5d4147000 nid=0x342a runnable [0x00002ae6a574e000]
"hystrix-ttbrain_predict_threadpoolKey-50" daemon prio=10 tid=0x00002ae5fc078000 nid=0x3426 runnable [0x00002ae6a4f4a000]
"hystrix-ttbrain_predict_threadpoolKey-49" daemon prio=10 tid=0x00002ae5f805a800 nid=0x3422 runnable [0x00002ae6a4746000]
"hystrix-ttbrain_predict_threadpoolKey-48" daemon prio=10 tid=0x00002ae5d413e800 nid=0x341e runnable [0x00002ae6a3f42000]
"hystrix-ttbrain_predict_threadpoolKey-47" daemon prio=10 tid=0x00002ae5fde57800 nid=0x341a runnable [0x00002ae6a373e000]
"hystrix-ttbrain_predict_threadpoolKey-46" daemon prio=10 tid=0x00002ae5f8053800 nid=0x3416 runnable [0x00002ae6a2f3a000]
"hystrix-ttbrain_predict_threadpoolKey-45" daemon prio=10 tid=0x00002ae65c1f1000 nid=0x3412 runnable [0x00002ae6a2736000]
"hystrix-ttbrain_predict_threadpoolKey-44" daemon prio=10 tid=0x00002ae5d4136000 nid=0x340e runnable [0x00002ae6a1f32000]
"hystrix-ttbrain_predict_threadpoolKey-43" daemon prio=10 tid=0x00002ae60865b800 nid=0x340a runnable [0x00002ae6a172e000]
"hystrix-ttbrain_predict_threadpoolKey-42" daemon prio=10 tid=0x00002ae5d412e800 nid=0x3406 runnable [0x00002ae6a0f2a000]
"hystrix-ttbrain_predict_threadpoolKey-41" daemon prio=10 tid=0x00002ae5fc0bf000 nid=0x3402 runnable [0x00002ae6a0726000]
"hystrix-ttbrain_predict_threadpoolKey-40" daemon prio=10 tid=0x00002ae5f804a800 nid=0x33fe runnable [0x00002ae69ff22000]
"hystrix-ttbrain_predict_threadpoolKey-39" daemon prio=10 tid=0x00002ae5d4126800 nid=0x33f9 runnable [0x00002ae69f51d000]
"hystrix-ttbrain_predict_threadpoolKey-38" daemon prio=10 tid=0x00002ae5fc0be000 nid=0x33f5 runnable [0x00002ae69ed18000]
"hystrix-ttbrain_predict_threadpoolKey-37" daemon prio=10 tid=0x00002ae5f8041800 nid=0x33f1 runnable [0x00002ae69e515000]
"hystrix-ttbrain_predict_threadpoolKey-36" daemon prio=10 tid=0x00002ae65c0f6000 nid=0x33ed runnable [0x00002ae69dd10000]
"hystrix-ttbrain_predict_threadpoolKey-35" daemon prio=10 tid=0x00002ae5d411d000 nid=0x33e9 runnable [0x00002ae69d50d000]
"hystrix-ttbrain_predict_threadpoolKey-34" daemon prio=10 tid=0x00002ae6081d1000 nid=0x33e5 runnable [0x00002ae69cd08000]
"hystrix-ttbrain_predict_threadpoolKey-33" daemon prio=10 tid=0x00002ae5d4115000 nid=0x33e1 runnable [0x00002ae69c505000]
"hystrix-ttbrain_predict_threadpoolKey-32" daemon prio=10 tid=0x00002ae5fc0ba800 nid=0x33dd runnable [0x00002ae69bd00000]
"hystrix-ttbrain_predict_threadpoolKey-31" daemon prio=10 tid=0x00002ae5f803a000 nid=0x33d9 runnable [0x00002ae69b4fd000]
"hystrix-ttbrain_predict_threadpoolKey-30" daemon prio=10 tid=0x00002ae5d410c800 nid=0x33d5 runnable [0x00002ae69acf8000]
"hystrix-ttbrain_predict_threadpoolKey-29" daemon prio=10 tid=0x00002ae5fc18a800 nid=0x33d1 runnable [0x00002ae69a4f4000]
"hystrix-ttbrain_predict_threadpoolKey-28" daemon prio=10 tid=0x00002ae5f8032000 nid=0x33cd runnable [0x00002ae699cf0000]
"hystrix-ttbrain_predict_threadpoolKey-27" daemon prio=10 tid=0x00002ae65c1ae000 nid=0x33c3 runnable [0x00002ae6994ed000]
"hystrix-ttbrain_predict_threadpoolKey-26" daemon prio=10 tid=0x00002ae5d4104000 nid=0x33bf runnable [0x00002ae698ce8000]
"hystrix-ttbrain_predict_threadpoolKey-25" daemon prio=10 tid=0x00002ae6081e3000 nid=0x33bb runnable [0x00002ae6984e5000]
"hystrix-ttbrain_predict_threadpoolKey-24" daemon prio=10 tid=0x00002ae5d40fc000 nid=0x33b1 runnable [0x00002ae697218000]
"hystrix-ttbrain_predict_threadpoolKey-23" daemon prio=10 tid=0x00002ae5fe253000 nid=0x33ad runnable [0x00002ae55cb05000]
"hystrix-ttbrain_predict_threadpoolKey-22" daemon prio=10 tid=0x00002ae5f802a800 nid=0x33a3 runnable [0x00002ae593d16000]
"hystrix-ttbrain_predict_threadpoolKey-21" daemon prio=10 tid=0x00002ae5d40f4000 nid=0x3149 runnable [0x00002ae696c16000]
"hystrix-ttbrain_predict_threadpoolKey-20" daemon prio=10 tid=0x00002ae6081d5000 nid=0x3145 runnable [0x00002ae696411000]
"hystrix-ttbrain_predict_threadpoolKey-19" daemon prio=10 tid=0x00002ae5d40ec000 nid=0x3141 runnable [0x00002ae695c0e000]
"hystrix-ttbrain_predict_threadpoolKey-18" daemon prio=10 tid=0x00002ae5fc042000 nid=0x313d runnable [0x00002ae695409000]
"hystrix-ttbrain_predict_threadpoolKey-17" daemon prio=10 tid=0x00002ae5f8024800 nid=0x3139 runnable [0x00002ae694c06000]
"hystrix-ttbrain_predict_threadpoolKey-16" daemon prio=10 tid=0x00002ae65c062000 nid=0x3135 runnable [0x00002ae694401000]
"hystrix-ttbrain_predict_threadpoolKey-15" daemon prio=10 tid=0x00002ae5d40e4800 nid=0x3131 runnable [0x00002ae693bfe000]
"hystrix-ttbrain_predict_threadpoolKey-14" daemon prio=10 tid=0x00002ae5fc169800 nid=0x312d runnable [0x00002ae6933f9000]
"hystrix-ttbrain_predict_threadpoolKey-13" daemon prio=10 tid=0x00002ae5f801d000 nid=0x3129 runnable [0x00002ae692bf5000]
"hystrix-ttbrain_predict_threadpoolKey-12" daemon prio=10 tid=0x00002ae65c053000 nid=0x3125 runnable [0x00002ae6923f1000]
"hystrix-ttbrain_predict_threadpoolKey-11" daemon prio=10 tid=0x00002ae5d416b800 nid=0x3121 runnable [0x00002ae690293000]
"hystrix-ttbrain_predict_threadpoolKey-10" daemon prio=10 tid=0x00002ae608022800 nid=0x311d runnable [0x00002ae6919ec000]
"hystrix-ttbrain_predict_threadpoolKey-9" daemon prio=10 tid=0x00002ae5d4175800 nid=0x3119 runnable [0x00002ae6911e9000]
"hystrix-ttbrain_predict_threadpoolKey-8" daemon prio=10 tid=0x00002ae5f8014800 nid=0x3113 runnable [0x00002ae6905e2000]
"hystrix-ttbrain_predict_threadpoolKey-7" daemon prio=10 tid=0x00002ae65c049000 nid=0x310e runnable [0x00002ae68f88d000]
"hystrix-ttbrain_predict_threadpoolKey-6" daemon prio=10 tid=0x00002ae5d40d6800 nid=0x3109 runnable [0x00002ae68ee89000]
"hystrix-ttbrain_predict_threadpoolKey-5" daemon prio=10 tid=0x00002ae5fc0e4000 nid=0x3104 runnable [0x00002ae68e484000]
"hystrix-ttbrain_predict_threadpoolKey-4" daemon prio=10 tid=0x00002ae5f8009800 nid=0x30ff runnable [0x00002ae68da7e000]
"hystrix-ttbrain_predict_threadpoolKey-3" daemon prio=10 tid=0x00002ae65c03f000 nid=0x30fa runnable [0x00002ae68d079000]
"hystrix-ttbrain_predict_threadpoolKey-2" daemon prio=10 tid=0x00002ae5d40c3000 nid=0x30ed runnable [0x00002ae5ea05a000]
"hystrix-ttbrain_predict_threadpoolKey-1" daemon prio=10 tid=0x00002ae608657000 nid=0x2eee runnable [0x00002ae600600000]

      查看java.net.SocketInputStream.socketRead0(Native Method)解决方法,才发现是由于thrift 客户单没有设置socketTimeout,导致所有的线程一直在等待服务端的数据,知道Linux系统默认的socketTimeout超时后才退出。

解决方案:设置thrift 客户端的socketTimeOut。

TSocket tsocket = new TSocket(address.getHostName(), address.getPort(),soReadTimeOut);

第三个参数就是setSoTimeout