在节点间交互中我们已经知道了,cluster集群是如何做到节点间通信和故障发现的.这里总结下集群是如何做故障转移(Failover)的.
故障转移
故障转移的逻辑也是在clusterCron()方法中定时触发执行的.具体流程都在clusterHandleSlaveFailover(void)方法中.
1. 基本概念
为了更好理解源码,先同步下变量的含义.
server.cluster->failover_auth_time: 表示slave节点开始进行故障转移的时刻;
auth_age: 从发起 failover开始时间到现在过去的时间。
needed_quorum: 故障转移需要的选票数量;
server.cluster: 是主节点数量;
auth_timeout: 当前slave发起投票后,等待回应的超时时间,至少为 2s.如果超过该时间还没有获得足够的选票,那么表示本次failover失败;
auth_retry_time: 发起下一次故障转移的时间间隔;
mstime_t data_age; mstime_t auth_age = mstime() - server.cluster->failover_auth_time; int needed_quorum = (server.cluster->size / 2) + 1; int manual_failover = server.cluster->mf_end != 0 && server.cluster->mf_can_start; mstime_t auth_timeout, auth_retry_time; server.cluster->todo_before_sleep &= ~CLUSTER_TODO_HANDLE_FAILOVER; /* Compute the failover timeout (the max time we have to send votes * and wait for replies), and the failover retry time (the time to wait * before trying to get voted again). * * Timeout is MAX(NODE_TIMEOUT*2,2000) milliseconds. * Retry is two times the Timeout. */ auth_timeout = server.cluster_node_timeout*2; if (auth_timeout < 2000) auth_timeout = 2000; auth_retry_time = auth_timeout*2;
2. 判断节点是否能发起故障转移
能发起failover的节点必须满足以下条件:
a. slave 节点
b. master 不为空
c. master 负责的 slot 数量不为空
d. master 被标记成了 FAIL,或者是一个主动 failover(manual_failover为真)
/* Pre conditions to run the function, that must be met both in case * of an automatic or manual failover: * 1) We are a slave. * 2) Our master is flagged as FAIL, or this is a manual failover. * 3) We don't have the no failover configuration set, and this is * not a manual failover. * 4) It is serving slots. */ if (nodeIsMaster(myself) || myself->slaveof == NULL || (!nodeFailed(myself->slaveof) && !manual_failover) || (server.cluster_slave_no_failover && !manual_failover) || myself->slaveof->numslots == 0) { /* There are no reasons to failover, so we set the reason why we * are returning without failing over to NONE. */ server.cluster->cant_failover_reason = CLUSTER_CANT_FAILOVER_NONE; return; }
3. 判断节点数据是否太旧
data_age:表示当前slave节点多长时间没有与master节点交互过了.如果slave节点的数据太旧就不能替换掉下线 master 节点,因此只能人工处理。
/* Set data_age to the number of seconds we are disconnected fromthe master. */
if (server.repl_state == REPL_STATE_CONNECTED) {
data_age = (mstime_t)(server.unixtime - server.master->lastinteraction)
* 1000;
} else {
data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;
}
/* Remove the node timeout from the data age as it is fine that we are
* disconnected from our master at least for the time it was down to be
* flagged as FAIL, that's the baseline. */
if (data_age > server.cluster_node_timeout)
data_age -= server.cluster_node_timeout;
/* Check if our data is recent enough according to the slave validity
* factor configured by the user.
* Check bypassed for manual failovers. */
if (server.cluster_slave_validity_factor &&
data_age >
(((mstime_t)server.repl_ping_slave_period * 1000) +
(server.cluster_node_timeout * server.cluster_slave_validity_factor)))
{
if (!manual_failover) {
clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);
return;
}
}
4. 启动故障转移流程
满足条件(auth_age > auth_retry_time)后,发起故障转移流程,将自己的数据和节点等信息广播出去
ailover_auth_rank:根据clusterGetSlaveRank()可以看出,排名根据数据复制位置来定,复制数据量越多,排名越靠前,越早进行故障转移;
/* If the previous failover attempt timedout and the retry time has * elapsed, we can setup a new one. */ if (auth_age > auth_retry_time) { server.cluster->failover_auth_time = mstime() + 500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */ random() % 500; /* Random delay between 0 and 500 milliseconds. */ server.cluster->failover_auth_count = 0; server.cluster->failover_auth_sent = 0; server.cluster->failover_auth_rank = clusterGetSlaveRank(); /* We add another delay that is proportional to the slave rank. * Specifically 1 second * rank. This way slaves that have a probably * less updated replication offset, are penalized. */ server.cluster->failover_auth_time += server.cluster->failover_auth_rank * 1000; /* However if this is a manual failover, no delay is needed. */ if (server.cluster->mf_end) { server.cluster->failover_auth_time = mstime(); server.cluster->failover_auth_rank = 0; } serverLog(LL_WARNING, "Start of election delayed for %lld milliseconds " "(rank #%d, offset %lld).", server.cluster->failover_auth_time - mstime(), server.cluster->failover_auth_rank, replicationGetSlaveOffset()); /* Now that we have a scheduled election, broadcast our offset * to all the other slaves so that they'll updated their offsets * if our offset is better. */ clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES); return; }
5. 集群内拉选票
向集群内广播票选信息
type=CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST
/* Ask for votes if needed. */ if (server.cluster->failover_auth_sent == 0) { server.cluster->currentEpoch++; server.cluster->failover_auth_epoch = server.cluster->currentEpoch; ... clusterRequestFailoverAuth();... server.cluster->failover_auth_sent = 1; return; /* Wait for replies. */ }
6. 故障转移,从主切换
节点切换为master
主要流程在clusterFailoverReplaceYourMaster(void)方法中
1.将节点相关信息修改为主节点:例如主从节点等信息
2.接受旧master的hash槽信息
3.广播通知其他所有节点
7. 节点投票
各节点在接收到CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST报文时,会进行投票,具体方法参考clusterSendFailoverAuthIfNeeded();
a. 首先会判断节点必须为主节点,而且负责一部分hash槽
if (nodeIsSlave(myself) || myself->numslots == 0) return;
b. 保证一次failover只做一次投票
/* I already voted for this epoch? Return ASAP. */if (server.cluster->lastVoteEpoch == server.cluster->currentEpoch) { serverLog(LL_WARNING, "Failover auth denied to %.40s: already voted for epoch %llu", node->name, (unsigned long long) server.cluster->currentEpoch); return;}
c. 组装报文发回给sender节点
type=CLUSTERMSG_TYPE_FAILOVER_AUTH_ACKclusterSendFailoverAuth(node);
8. 延迟处理
在上述流程中,部分操作都是延后一段时间执行的,这样做的目的是让信息在各节点充分转发.