探寻 Redis 超时元凶的全过程剖析

原创

mb5fed6f751964c 2021-02-03 07:45:15 ©著作权

文章标签 Java 文章分类 Redis 数据库

©著作权归作者所有：来自51CTO博客作者mb5fed6f751964c的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、问题

1.应用端使用了我们提供的一个 redis-sentinel 集群(1主，1从，3个sentinel)

2.客户端设置了超时时间为 200ms, 下面是应用端提供的超时日志。注意上图对象数只有 265 个。

2016-02-03 14:20:42,981 [DubboServerHandler-10.16.xx.xx:20880-thread-51] WARN  com.xx.DramaTabRelatePgcComponentImpl$1 (DataComponentCommand.java:76) - commandKey=drama_tab_pgc groupKey=drama_tab_pgc_pool poolKey=drama_tab_pgc_pool timeout cost=201 ms

...................

2016-02-03 14:20:40,168 [DubboServerHandler-10.16.xx.xx:20880-thread-9] WARN  com.xx.DramaTabRelatePgcComponentImpl$1 (DataComponentCommand.java:76) - commandKey=drama_tab_pgc groupKey=drama_tab_pgc_pool poolKey=drama_tab_pgc_pool timeout cost=200 ms

2016-02-03 09:56:14,174 [DubboServerHandler-10.16.xx.xx:20880-thread-146] WARN  com.xx.DramaTabRelatePgcComponentImpl$1 (DataComponentCommand.java:76) - commandKey=drama_tab_pgc groupKey=drama_tab_pgc_pool poolKey=drama_tab_pgc_pool timeout cost=200 ms

.....................

2016-02-03 12:32:03,575 [DubboServerHandler-10.16.xx.xx:20880-thread-125] WARN  com.xx.DramaTabRelatePgcComponentImpl$1 (DataComponentCommand.java:76) - commandKey=drama_tab_pgc groupKey=drama_tab_pgc_pool poolKey=drama_tab_pgc_pool timeout cost=200

二、逐个排查

1.Redis慢查询：并没有发现慢查询，跳过 2.Redis 日志：对象数只有 265 个，注意从 2016 年 1 月 25 后就没日志了，所以并没有什么异常，也没什么 RDB 和 AOF 重写。跳过

3.机器：tsar 观察 cpu，内存，网络，负载，本地 IO cpu、内存、负载、本地IO比较正常。唯一以前怀疑的是网络，按理说这个流量也不是很大，但是看了一下机房的拓扑关系以及redis-cli的测试就了解原因了，下一节进行简单分析。

三、怀疑并确认

应用端与服务器网络

redis-cli

redis-cli 是个比较好的方法来测 redis 的延迟，为此我们用下面的 api 来测试，发现会出现 200ms 的情况。

具体原因猜测：机房之间的带宽有限，据说只有 50M，这台机器本身流量就有点大而且是台虚机。

redis-cli --latency-history -h ip -p port


[@zw-34-55 ~]# redis-cli --latency-history -h 10.11.132.xx -p 6388

min: 0, max: 1, avg: 0.48 (1290 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.87 (1264 samples) -- 15.00 seconds range

min: 0, max: 1, avg: 0.60 (1275 samples) -- 15.01 seconds range

min: 0, max: 202, avg: 0.69 (1265 samples) -- 15.00 seconds range

min: 0, max: 202, avg: 0.81 (1271 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.79 (1254 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.52 (1283 samples) -- 15.00 seconds range

min: 0, max: 1, avg: 0.50 (1288 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.89 (1260 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.57 (1277 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.52 (1284 samples) -- 15.00 seconds range

min: 0, max: 1, avg: 0.69 (1284 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.76 (1256 samples) -- 15.00 seconds range

min: 0, max: 1, avg: 0.48 (1300 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.55 (1297 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.89 (1257 samples) -- 15.01 seconds range

min: 0, max: 202, avg: 0.68 (1277 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.52 (1296 samples) -- 15.01 seconds range

min: 0, max: 1, avg: 0.74 (1278 samples) -- 15.00 seconds range

min: 0, max: 1, avg: 0.73 (1271 samples) -- 15.00 seconds rang

四、解决和观察

思路：写入端是后台管理系统，流量较小，读端流量较大，为此让master改为电信机房。

解决方法：(1) 添加一个电信的slave (2) 下线老的联通slave (3) 主从做sentinel failover

总结：查询超时的基本思路 + 分配机器要考虑流量的合理性。

参考文献

```
Redis latency problems troubleshooting
```

上一篇：探寻 Redis 内存诡异增长的元凶

下一篇：分布式锁不是控制并发幂等的方式

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯