前段时间在生产中遇到一个问题,即系统需要从一个网段迁移到另一个网段。我们知道redis集群在创建时是指定了节点的ip:port,因此在节点IP变更后,集群自然就失效了。如果需要恢复集群怎么办?当然在大部分情况下,我们可以选择删除所有节点的数据文件dbfilename、持久化文件appendfilename、集群配置文件cluster-config-file,然后重建集群。但是如果需要保留数据,又该怎么操作呢?

以下以一个三主三从的单副本集群来演示恢复过程:

[root@test1 bin]# ./redis-cli -a password --cluster check 192.168.66.101:7000
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
192.168.66.101:7000 (d1ddeaa7...) -> 334 keys | 5461 slots | 1 slaves.
192.168.66.102:7003 (d21ce248...) -> 341 keys | 5462 slots | 1 slaves.
192.168.66.101:7001 (bb5c5e76...) -> 325 keys | 5461 slots | 1 slaves.
[OK] 1000 keys in 3 masters.
0.06 keys per slot on average.
>>> Performing Cluster Check (using node 192.168.66.101:7000)
M: d1ddeaa7c77e35b3df50953fc09834b662cbac8b 192.168.66.101:7000
   slots:[0-5460] (5461 slots) master
   1 additional replica(s)
M: d21ce2482179af3b76a9f29d870848bae18a3214 192.168.66.102:7003
   slots:[5461-10922] (5462 slots) master
   1 additional replica(s)
S: 089b2e16dff1f68c399a1efc73580e7cbbbfa71b 192.168.66.101:7002
   slots: (0 slots) slave
   replicates d21ce2482179af3b76a9f29d870848bae18a3214
S: 92d8208b582c6111bd383b6fdfc2d80a86f47350 192.168.66.102:7005
   slots: (0 slots) slave
   replicates d1ddeaa7c77e35b3df50953fc09834b662cbac8b
S: ea68bec54e3deb0bd209f151151098ae6d8cf0b4 192.168.66.102:7004
   slots: (0 slots) slave
   replicates bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65
M: bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65 192.168.66.101:7001
   slots:[10923-16383] (5461 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

将集群中所有节点的IP由192.168.66.*更换为192.168.77.*,此时如果尝试检查集群状态,可以看到集群仍然尝试连接192.168.66.*网段的节点:

[root@test1 bin]# ./redis-cli -a password --cluster check 192.168.77.101:7000
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at 192.168.66.102:7003: Connection timed out
......

shutdown所有节点。

找到节点配置文件中cluster-config-file,此字段配置集群配置文件的保存位置(本示例中为/data/redis/cluster/7000/nodes_7000.conf),查看该文件内容:

[root@test1 ~] cat /data/redis/cluster/7000/nodes_7000.conf
d1ddeaa7c77e35b3df50953fc09834b662cbac8b 192.168.66.101:7000@17000 myself,master - 0 1626244031000 1 connected 0-5460
ea68bec54e3deb0bd209f151151098ae6d8cf0b4 192.168.66.102:7004@17004 slave bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65 0 1626
244034813 5 connected
d21ce2482179af3b76a9f29d870848bae18a3214 192.168.66.102:7003@17003 master - 0 1626244033803 4 connected 5461-10922
089b2e16dff1f68c399a1efc73580e7cbbbfa71b 192.168.66.101:7002@17002 slave d21ce2482179af3b76a9f29d870848bae18a3214 0 1626
244032793 4 connected
bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65 192.168.66.101:7001@17001 master - 0 1626244030770 2 connected 10923-16383
92d8208b582c6111bd383b6fdfc2d80a86f47350 192.168.66.102:7005@17005 slave d1ddeaa7c77e35b3df50953fc09834b662cbac8b 0 1626
244031782 6 connected
vars currentEpoch 6 lastVoteEpoch 0

将所有节点的cluster-config-file文件中的IP地址均由192.168.66.*改为192.168.77.*:

# 192.168.66.101 执行
sed -i 's/192.168.66/192.168.77/g' /data/redis/cluster/7000/nodes_7000.conf /data/redis/cluster/7001/nodes_7001.conf /data/redis/cluster/7002/nodes_7002.conf

# 192.168.66.102 执行
sed -i 's/192.168.66/192.168.77/g' /data/redis/cluster/7003/nodes_7003.conf /data/redis/cluster/7004/nodes_7004.conf /data/redis/cluster/7005/nodes_7005.conf

启动所有节点。

再次检查集群状态:

[root@test1 bin]# ./redis-cli -a password --cluster check 192.168.77.101:7000
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
192.168.77.101:7000 (d1ddeaa7...) -> 334 keys | 5461 slots | 1 slaves.
192.168.77.102:7003 (d21ce248...) -> 341 keys | 5462 slots | 1 slaves.
192.168.77.101:7001 (bb5c5e76...) -> 325 keys | 5461 slots | 1 slaves.
[OK] 1000 keys in 3 masters.
0.06 keys per slot on average.
>>> Performing Cluster Check (using node 192.168.77.101:7000)
M: d1ddeaa7c77e35b3df50953fc09834b662cbac8b 192.168.77.101:7000
   slots:[0-5460] (5461 slots) master
   1 additional replica(s)
S: 92d8208b582c6111bd383b6fdfc2d80a86f47350 192.168.77.102:7005
   slots: (0 slots) slave
   replicates d1ddeaa7c77e35b3df50953fc09834b662cbac8b
M: d21ce2482179af3b76a9f29d870848bae18a3214 192.168.77.102:7003
   slots:[5461-10922] (5462 slots) master
   1 additional replica(s)
M: bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65 192.168.77.101:7001
   slots:[10923-16383] (5461 slots) master
   1 additional replica(s)
S: 089b2e16dff1f68c399a1efc73580e7cbbbfa71b 192.168.77.101:7002
   slots: (0 slots) slave
   replicates d21ce2482179af3b76a9f29d870848bae18a3214
S: ea68bec54e3deb0bd209f151151098ae6d8cf0b4 192.168.77.102:7004
   slots: (0 slots) slave
   replicates bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

[root@test1 bin]# ./redis-cli -a password --cluster info 192.168.77.101:7000 
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
192.168.77.101:7000 (d1ddeaa7...) -> 334 keys | 5461 slots | 1 slaves.
192.168.77.102:7003 (d21ce248...) -> 341 keys | 5462 slots | 1 slaves.
192.168.77.101:7001 (bb5c5e76...) -> 325 keys | 5461 slots | 1 slaves.
[OK] 1000 keys in 3 masters.
0.06 keys per slot on average.

可以看到集群状态已经恢复,key数量与IP变更前一致。

测试一下集群的数据写入和读取:

[root@test1 bin]# ./redis-cli -a password -c -h 192.168.77.101 -p 7000       
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
192.168.77.101:7000> keys *
  1) "name725"
  2) "name359"

......
192.168.77.101:7000> get name7
"hello\n"
192.168.77.101:7000> get name400
-> Redirected to slot [11448] located at 192.168.77.101:7001
"hello\n"
192.168.77.101:7001> set testkey 'testvalue'
-> Redirected to slot [4757] located at 192.168.77.101:7000
OK
192.168.77.101:7000> get testkey
"testvalue"

原有数据读取正常,新数据写入读取正常,集群恢复。

总结

redis集群节点更换IP后,只需要修改所有节点 cluster-config-file 中的IP地址为新地址,并重启所有节点,集群即可自动恢复。