Redis进阶 - 因异常断电导致的Redis Cluster Fail故障处理_Redis教程

Pre

测试环境,搭建的伪集群

101 : 7001 7002 7003 三个节点
102 : 7004 7005 7006 三个节点

机房异常断电,主机宕机~


现象

Redis Cluster 不可用 ,应用无法正常启动

查看集群信息 ,如下

172.168.15.101:7001> CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:16354
cluster_slots_ok:16354
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:7
cluster_my_epoch:1
cluster_stats_messages_ping_sent:1666
cluster_stats_messages_pong_sent:1063
cluster_stats_messages_sent:2729
cluster_stats_messages_ping_received:1063
cluster_stats_messages_pong_received:1026
cluster_stats_messages_received:2089

划重点 cluster_state:fail cluster_slots_assigned:16354 , 集群状态 fail , 分配的slots 16354 < 16384 , 少了30个slots ,集群不可用。

为了保证集群完整性, 默认情况下当集群16384个槽任何一个没有指派到节点时整个集群不可用。这是对集群完整性的一种保护措施, 保证所有的槽都指派给在线的节点。

可以看到 slot 有未分配的情况, 那如何重新分配这些slots 便是解决问题的关键。


查找未指派的slots

方式一 cluster slots

172.168.15.101:7001> CLUSTER SLOTS
 1) 1) (integer) 5461
    2) (integer) 5591
    3) 1) "172.168.15.101"
    ....
    ...
    ....
    33) 1) (integer) 0
    2) (integer) 5460
    3) 1) "172.168.15.101"
       2) (integer) 7001
       3) "40b3ab3eb00e0107ea702e96231694016fb5c25f"
    4) 1) "172.168.15.102"
       2) (integer) 7006
       3) "b2392a54bc1ed255d9f86ce5315b3c66177bc54c"
172.168.15.101:7001>


太多了,并且这么看也不方便统计,推荐第二种方式


方式二 cluster nodes

172.168.15.101:7001> cluster nodes
f434df4b2a8e8262e91b192fdd4329ac7eaba257 172.168.15.101:7003@17003 master - 0 1589854185127 7 connected 5461-5591 5593-5783 5785-5913 5915-6157 6159-6264 6266-6290 6292-6311 6313-6401 6403-6963 6965-7228 7230-7566 7568-7647 7649-7862 7864-8199 8201-8693 8695-8805 8807-8832 8834-9229 9231-9305 9307-9353 9355-9477 9479-9696 9698-9761 9763-9855 9857-10241 10243-10265 10267-10310 10312-10348 10350-10529 10531-10669 10671-10922
8c27d256907bd17ceed4b0bfc8474eb90e7cf71e 172.168.15.102:7004@17004 slave f434df4b2a8e8262e91b192fdd4329ac7eaba257 0 1589854187127 7 connected
8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d 172.168.15.101:7002@17002 master - 0 1589854186127 2 connected 10923-16383
b2392a54bc1ed255d9f86ce5315b3c66177bc54c 172.168.15.102:7006@17006 slave 40b3ab3eb00e0107ea702e96231694016fb5c25f 0 1589854185000 6 connected
40b3ab3eb00e0107ea702e96231694016fb5c25f 172.168.15.101:7001@17001 myself,master - 0 1589854184000 1 connected 0-5460 [5592-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [5784-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [5914-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [6158-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [6265-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [6291-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [6312-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [6402-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [6964-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [7229-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [7567-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [7648-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [7863-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [8200-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [8694-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [8806-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [8833-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [9230-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [9306-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [9354-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [9478-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [9697-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [9762-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [9856-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [10242-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [10266-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [10311-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [10349-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [10530-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [10670-<-8c27d256907bd17ceed4b0bfc8474eb90e7cf71e] [10973-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [11020-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [11140-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [11144-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [11200-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [11624-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [11802-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [12201-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [12301-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [12681-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [12685-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [13365-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [13676-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [13969-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [13989-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [14395-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [14412-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [15149-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [15611-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [15654-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [15758-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [15778-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [15899-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [16100-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [16105-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d] [16147-<-8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d]
6d8f2f251fa2d881cae91012088e1d5eb653ebb4 172.168.15.102:7005@17005 slave 8dff9fa8b74dd6cdf90a706c3945fbe2025cb57d 0 1589854186000 5 connected


7002 : 10923-16383

7001: 0-5460

7003 : 5461-5591 5593-5783 5785-5913 5915-6157 6159-6264 6266-6290 6292-6311 6313-6401 6403-6963 6965-7228 7230-7566 7568-7647 7649-7862 7864-8199 8201-8693 8695-8805 8807-8832 8834-9229 9231-9305 9307-9353 9355-9477 9479-9696 9698-9761 9763-9855 9857-10241 10243-10265 10267-10310 10312-10348 10350-10529 10531-10669 10671-10922

缺哪些slot ,可以知道了吧

cluster nodes的格式 随后分析一下 ~~~


计算未指派的slots ,重新添加

看7003 这个master 后面的slot分布情况

5461-5591 5593-5783 5785-5913 5915-6157 6159-6264 6266-6290 6292-6311 6313-6401 6403-6963 6965-7228 7230-7566 7568-7647 7649-7862 7864-8199 8201-8693 8695-8805 8807-8832 8834-9229 9231-9305 9307-9353 9355-9477 9479-9696 9698-9761 9763-9855 9857-10241 10243-10265 10267-10310 10312-10348 10350-10529 10531-10669 10671-10922

缺少 5592 5784 5914 6158 6265 6291 6312 6402 6964 7229 7567 7648 7863 8200 8694 8806 8833 9230 9306 9354 9478 9697 9762 9856 10242 10266 10311 10349 10530 10670

重新分配下


172.168.15.101:7001> CLUSTER ADDSLOTS 5592 5784 5914 6158 6265  6291 6312 6402 6964 7229 7567 7648 7863 8200 8694 8806 8833 9230 9306 9354 9478 9697 9762 9856 10242 10266 10311 10349 10530 10670 
OK
172.168.15.101:7001>

过一会儿,重新查看下

172.168.15.101:7001> CLUSTER INFO
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:7
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2108
cluster_stats_messages_pong_sent:1508
cluster_stats_messages_sent:3616
cluster_stats_messages_ping_received:1508
cluster_stats_messages_pong_received:1468
cluster_stats_messages_update_received:19
cluster_stats_messages_received:2995
172.168.15.101:7001>

 

OK了


Redisson 初始化失败 (Not all slots are covered! Only 10923 slots are avaliable + Failed to add master: redis://172.168.15.101:7002 for slot ranges: [[10923-16383]]. Reason - cluster_state:fail)

Redisson配置了集群地址

[2020-05-19 10:44:33,539] INFO [localhost-startStop-1] RedissonManager.<clinit>(27) | redisson client begin to init....
[2020-05-19 10:44:36,365] ERROR [localhost-startStop-1] RedissonManager.<clinit>(52) | org.redisson.client.RedisConnectionException: Not all slots are covered! Only 10923 slots are avaliable
        at org.redisson.cluster.ClusterConnectionManager.<init>(ClusterConnectionManager.java:167)
        at org.redisson.config.ConfigSupport.createConnectionManager(ConfigSupport.java:198)
        at org.redisson.Redisson.<init>(Redisson.java:122)
        at org.redisson.Redisson.create(Redisson.java:159)

       .......
         .......
           .......
Caused by: org.redisson.client.RedisException: Failed to add master: redis://172.168.15.101:7002 for slot ranges: [[10923-16383]]. Reason - cluster_state:fail
        at org.redisson.cluster.ClusterConnectionManager$1$1.operationComplete(ClusterConnectionManager.java:223)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)

原因很明确了 redis://172.168.15.101:7002 for slot ranges: [[10923-16383]]. Reason - cluster_state:fail

连上7002端口 (一定要上7002上看,不要再其他端口查看节点信息),重复刚才的操作 。

期间重启了几次节点 ,故障恢复 。