前导
上次Redis MQ分布式改造完成之后, 编排的容器稳定运行了一个多月,昨天突然收到ETL端同事通知,没有采集到解析日志了。
赶紧进服务器看了一下,用于数据接收的receiver容器挂掉了, 尝试docker container start [containerid], 几分钟后该容器再次崩溃。
Redis连接超限
docker log [containerid] 查看容器日志; 重点:CSRedis.RedisException: ERR max number of clients reached
日志上显示连接Redis服务器的客户端数量超限,头脑快速思考,目前编排的某容器使用CSRedisCore 对于16个Redis DB实例化了16个客户端,但Redis服务器也不至于这么不经折腾吧。
赶紧进redis.io官网搜集相关资料。
After the client is initialized, Redis checks if we are already at the limit of the number of clients that it is possible to handle simultaneously (this is configured using the
maxclients
configuration directive, see the next section of this document for further information).In case it can't accept the current client because the maximum number of clients was already accepted, Redis tries to send an error to the client in order to make it aware of this condition, and closes the connection immediately. The error message will be able to reach the client even if the connection is closed immediately by Redis because the new socket output buffer is usually big enough to contain the error, so the kernel will handle the transmission of the error.
大致意思是:Redis服务器maxclients配置了客户端连接数, 如果当前连接的客户端超限,Redis会回发一个错误消息给客户端,并迅速关闭客户端连接。
立刻登录Redis服务器查看默认配置,确认当前Redis服务器默认配置是10000。
After the client is initialized, Redis checks if we are already at the limit of the number of clients that it is possible to handle simultaneously (this is configured using the
maxclients
configuration directive, see the next section of this document for further information).In case it can't accept the current client because the maximum number of clients was already accepted, Redis tries to send an error to the client in order to make it aware of this condition, and closes the connection immediately. The error message will be able to reach the client even if the connection is closed immediately by Redis because the new socket output buffer is usually big enough to contain the error, so the kernel will handle the transmission of the error.
左图表明:通过Redis-Cli 登录进服务器立即就被踢下线。
基本可认定redis客户端使用方式有问题。
CSRedisCore使用方式
继续查看相关资料,可在redis服务器上利用redis-cli命令:info clients、client list仔细分析客户端连接。
info clients 命令显示现场确实有10000的连接数;
client list命令显示连接如下
官方对client list命令输出字段的解释:
- addr: The client address, that is, the client IP and the remote port number it used to connect with the Redis server.
- fd: The client socket file descriptor number.
- name: The client name as set by CLIENT SETNAME.
- age: The number of seconds the connection existed for.
- idle: The number of seconds the connection is idle.
- flags: The kind of client (N means normal client, check the full list of flags).
- omem: The amount of memory used by the client for the output buffer.
- cmd: The last executed command.
根据以上解释,表明 Redis服务器收到很多ip=172.16.1.3(故障容器在网桥内的Ip 地址)的客户端连接,这些连接最后发出的是ping命令(这是一个测试命令)
故障容器使用的Redis客户端是CSRedisCore,该客户端只是单纯将 Msg 写入Redis list 数据结构,CSRedisCore上相关github issue给了我一些启发。
发现自己将CSRedisClient实例化代码写在 .netcore api Controller构造函数,这样每次请求构造Controller时都 实例化一次Redis客户端,最终Redis客户端不超限才怪。
依赖注入三种模式: 单例(系统内单一实例,一次性注入);瞬态(每次请求产生实例并注入);自定义范围。
有关dotnet apicontroller 以瞬态模式 注入,请查阅链接。
赶紧将CSRedisCore实例化代码移到 startup.cs 并注册为单例。
大胆求证
info clients命令显示稳定在53个Redis连接。
client list命令显示:172.16.1.3(故障容器)建立了50个客户端连接,编排的另一个容器webapp建立了2个连接,redis-cli命令登录到服务器建立了1个连接。
那么问题来了,修改之后,receiver容器为什么还稳定建立了50个redis连接?
进一步与CSRedisCore原作者沟通,确定CSRedisCore有预热机制,默认在连接池中预热了50个连接。
bingo,故障和困惑全部排查清楚。
总结
经此一役,在使用CSRedisCore客户端时, 要深入理解
① Stackexchange.Redis 使用的多路复用连接机制(使用时很容易想到注册到单例),CSRedisCore开源库采用连接池机制,在高并发场景下强烈建议注册为单例, 否则在生产使用中可能会误用在瞬态请求中实例化,导致redis客户端几天之后被占满。
② CSRedisCore会默认建立连接池,预热50个连接, 开发者心里要有数。
额外的方法论: 尽量不要从某度找答案,要学会问问题,并尝试从官方、stackoverflow 、github社区寻求解答,你挖过的坑也许别人早就挖过并踏平过。