redis cluster 投票权 redis cluster 选举算法

转载

智能开发先锋 2023-07-09 16:29:03

文章标签 redis cluster 投票权优先级 c# 自增 文章分类 Redis 数据库

1. 集群选举的处理
在 RedisCluster 集群实现原理中提到过从节点通过选举晋升为主节点的过程，其处理大致如下：

Slave 节点在每个周期任务中都会检查 Master 节点是否 FAIL，如是则尝试进行 Failover，以期成为新的 Master。不过在此之前需要过滤可用的 Slave 节点，具体做法就是检查每个 Slave 节点与 Master 节点断开连接的时间，如果超过了cluster-node-timeout * cluster-slave-validity-factor，那这个Slave 节点就没有资格切换成 Master
由于挂掉的 Master 可能会有多个 Slave 节点，所以在 Slave 节点发起选举前需要根据所有 Slave 节点的复制偏移量确定一个选举的优先级，Slave 节点的优先级越小则其发起选举的时间越靠前
当 Slave 节点到达其选举时间时，将自身的 currentEpoch 变量（该变量可视作集群所处版本的记录）自增，然后向集群中所有的节点发送 CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 消息，请求其他集群节点进行投票
所有集群节点都会收到广播消息，但是只有 Master 节点会对其进行回应。Master 节点首先确定自身 currentEpoch 变量是否小于广播消息中请求投票的 Slave 节点的currentEpoch 变量，再加上其他一些校验最终确定是否为 Slave 节点投票。如果决定投票，则回复 CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 消息
Slave 节点发出广播后，会定时检查其他节点返回的消息，并更新其计票器 server.cluster->failover_auth_count
当 Salve节点获得的票数超过了集群内 Master 节点总数的一半，则认为选举胜出，其自身将切换为 Master 节点，并广播主从切换完成的消息

redis cluster 投票权 redis cluster 选举算法_优先级

2. 选举源码分析

redis cluster 投票权 redis cluster 选举算法_优先级_02

2.1 Slave 节点发起选举
集群定时任务 cluster.c#clusterCron() 源码比较长，此处省略了与故障转移相关不相关的部分，可以看到只要当前节点为 Slave 节点，并且其配置允许进行故障转移，则会调用 clusterHandleSlaveFailover() 函数尝试进行故障转移

void clusterCron(void) {
 dictIterator *di;
 dictEntry *de;
 int update_state = 0;
 int orphaned_masters; /* How many masters there are without ok slaves. */
 int max_slaves; /* Max number of ok slaves for a single master. */
 int this_slaves; /* Number of ok slaves for our master (if we are slave). */
 mstime_t min_pong = 0, now = mstime();
 clusterNode *min_pong_node = NULL;
 static unsigned long long iteration = 0;
 mstime_t handshake_timeout;

 iteration++; /* Number of times this function was called so far. */

 ......


 if (nodeIsSlave(myself)) {
     clusterHandleManualFailover();
     if (!(server.cluster_module_flags & CLUSTER_MODULE_FLAG_NO_FAILOVER))
         clusterHandleSlaveFailover();
     /* If there are orphaned slaves, and we are a slave among the masters
      * with the max number of non-failing slaves, consider migrating to
      * the orphaned masters. Note that it does not make sense to try
      * a migration if there is no master with at least *two* working
      * slaves. */
     if (orphaned_masters && max_slaves >= 2 && this_slaves == max_slaves)
         clusterHandleSlaveMigration(max_slaves);
 }

 if (update_state || server.cluster->state == CLUSTER_FAIL)
     clusterUpdateState();
}

cluster.c#clusterHandleSlaveFailover() 函数实现比较长，不过流程较为清晰：

首先是各个变量的初始化，比如 needed_quorum(投票胜出所需的票数)， auth_timeout(选举超时时间)，auth_retry_time(选举重试时间) 等等
检查发起选举的前置条件，其中包括：
1. 本节点为 Slave 节点
2. 本节的 Master 节点被标记为 FAIL，也就是客观下线了
3. 不允许进行故障转移的配置项未打开
4. 本节点的 Master 节点必须负责至少一个 Slot 的数据
以上条件检查通过，则确定可以进行故障转移。此时如果 auth_retry_time(选举重试时间) 已经到了，则重新调用 clusterGetSlaveRank() 函数计算 Slave 节点优先级，并确定重新发起选举的时间
如不是选举超时的场景，同样调用 clusterGetSlaveRank() 函数确定 Slave 优先级及选举时间，当选举时间到达，则将 currentEpoch 自增，并调用 clusterRequestFailoverAuth() 函数发起选举

void clusterHandleSlaveFailover(void) {
 mstime_t data_age;
 mstime_t auth_age = mstime() - server.cluster->failover_auth_time;
 int needed_quorum = (server.cluster->size / 2) + 1;
 int manual_failover = server.cluster->mf_end != 0 &&
                       server.cluster->mf_can_start;
 mstime_t auth_timeout, auth_retry_time;

 server.cluster->todo_before_sleep &= ~CLUSTER_TODO_HANDLE_FAILOVER;

 /* Compute the failover timeout (the max time we have to send votes
  * and wait for replies), and the failover retry time (the time to wait
  * before trying to get voted again).
  *
  * Timeout is MAX(NODE_TIMEOUT*2,2000) milliseconds.
  * Retry is two times the Timeout.
  */
 auth_timeout = server.cluster_node_timeout*2;
 if (auth_timeout < 2000) auth_timeout = 2000;
 auth_retry_time = auth_timeout*2;

 /* Pre conditions to run the function, that must be met both in case
  * of an automatic or manual failover:
  * 1) We are a slave.
  * 2) Our master is flagged as FAIL, or this is a manual failover.
  * 3) We don't have the no failover configuration set, and this is
  *    not a manual failover.
  * 4) It is serving slots. */
 if (nodeIsMaster(myself) ||
     myself->slaveof == NULL ||
     (!nodeFailed(myself->slaveof) && !manual_failover) ||
     (server.cluster_slave_no_failover && !manual_failover) ||
     myself->slaveof->numslots == 0)
 {
     /* There are no reasons to failover, so we set the reason why we
      * are returning without failing over to NONE. */
     server.cluster->cant_failover_reason = CLUSTER_CANT_FAILOVER_NONE;
     return;
 }

 /* Set data_age to the number of seconds we are disconnected from
  * the master. */
 if (server.repl_state == REPL_STATE_CONNECTED) {
     data_age = (mstime_t)(server.unixtime - server.master->lastinteraction)
                * 1000;
 } else {
     data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;
 }

 /* Remove the node timeout from the data age as it is fine that we are
  * disconnected from our master at least for the time it was down to be
  * flagged as FAIL, that's the baseline. */
 if (data_age > server.cluster_node_timeout)
     data_age -= server.cluster_node_timeout;

 /* Check if our data is recent enough according to the slave validity
  * factor configured by the user.
  *
  * Check bypassed for manual failovers. */
 if (server.cluster_slave_validity_factor &&
     data_age >
     (((mstime_t)server.repl_ping_slave_period * 1000) +
      (server.cluster_node_timeout * server.cluster_slave_validity_factor)))
 {
     if (!manual_failover) {
         clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);
         return;
     }
 }

 /* If the previous failover attempt timedout and the retry time has
  * elapsed, we can setup a new one. */
 if (auth_age > auth_retry_time) {
     server.cluster->failover_auth_time = mstime() +
         500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
         random() % 500; /* Random delay between 0 and 500 milliseconds. */
     server.cluster->failover_auth_count = 0;
     server.cluster->failover_auth_sent = 0;
     server.cluster->failover_auth_rank = clusterGetSlaveRank();
     /* We add another delay that is proportional to the slave rank.
      * Specifically 1 second * rank. This way slaves that have a probably
      * less updated replication offset, are penalized. */
     server.cluster->failover_auth_time +=
         server.cluster->failover_auth_rank * 1000;
     /* However if this is a manual failover, no delay is needed. */
     if (server.cluster->mf_end) {
         server.cluster->failover_auth_time = mstime();
         server.cluster->failover_auth_rank = 0;
     clusterDoBeforeSleep(CLUSTER_TODO_HANDLE_FAILOVER);
     }
     serverLog(LL_WARNING,
         "Start of election delayed for %lld milliseconds "
         "(rank #%d, offset %lld).",
         server.cluster->failover_auth_time - mstime(),
         server.cluster->failover_auth_rank,
         replicationGetSlaveOffset());
     /* Now that we have a scheduled election, broadcast our offset
      * to all the other slaves so that they'll updated their offsets
      * if our offset is better. */
     clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);
     return;
 }

 /* It is possible that we received more updated offsets from other
  * slaves for the same master since we computed our election delay.
  * Update the delay if our rank changed.
  *
  * Not performed if this is a manual failover. */
 if (server.cluster->failover_auth_sent == 0 &&
     server.cluster->mf_end == 0)
 {
     int newrank = clusterGetSlaveRank();
     if (newrank > server.cluster->failover_auth_rank) {
         long long added_delay =
             (newrank - server.cluster->failover_auth_rank) * 1000;
         server.cluster->failover_auth_time += added_delay;
         server.cluster->failover_auth_rank = newrank;
         serverLog(LL_WARNING,
             "Replica rank updated to #%d, added %lld milliseconds of delay.",
             newrank, added_delay);
     }
 }

 /* Return ASAP if we can't still start the election. */
 if (mstime() < server.cluster->failover_auth_time) {
     clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_DELAY);
     return;
 }

 /* Return ASAP if the election is too old to be valid. */
 if (auth_age > auth_timeout) {
     clusterLogCantFailover(CLUSTER_CANT_FAILOVER_EXPIRED);
     return;
 }

 /* Ask for votes if needed. */
 if (server.cluster->failover_auth_sent == 0) {
     server.cluster->currentEpoch++;
     server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
     serverLog(LL_WARNING,"Starting a failover election for epoch %llu.",
         (unsigned long long) server.cluster->currentEpoch);
     clusterRequestFailoverAuth();
     server.cluster->failover_auth_sent = 1;
     clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
                          CLUSTER_TODO_UPDATE_STATE|
                          CLUSTER_TODO_FSYNC_CONFIG);
     return; /* Wait for replies. */
 }

 /* Check if we reached the quorum. */
 if (server.cluster->failover_auth_count >= needed_quorum) {
     /* We have the quorum, we can finally failover the master. */

     serverLog(LL_WARNING,
         "Failover election won: I'm the new master.");

     /* Update my configEpoch to the epoch of the election. */
     if (myself->configEpoch < server.cluster->failover_auth_epoch) {
         myself->configEpoch = server.cluster->failover_auth_epoch;
         serverLog(LL_WARNING,
             "configEpoch set to %llu after successful failover",
             (unsigned long long) myself->configEpoch);
     }

     /* Take responsibility for the cluster slots. */
     clusterFailoverReplaceYourMaster();
 } else {
     clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_VOTES);
 }
}

cluster.c#clusterGetSlaveRank()函数计算 Slave 节点优先级的方式很简单，可以看到其主要遍历了与当前 Slave 节点同属于一个 Master 的所有 Slave 节点，剔除掉其中不能进行故障转移的节点，当其他 Slave 节点的复制偏移量大于本 Slave 节点是本节点的 rank 自增，也就是选举时间往后推迟了

int clusterGetSlaveRank(void) {
 long long myoffset;
 int j, rank = 0;
 clusterNode *master;

 serverAssert(nodeIsSlave(myself));
 master = myself->slaveof;
 if (master == NULL) return 0; /* Never called by slaves without master. */

 myoffset = replicationGetSlaveOffset();
 for (j = 0; j < master->numslaves; j++)
     if (master->slaves[j] != myself &&
         !nodeCantFailover(master->slaves[j]) &&
         master->slaves[j]->repl_offset > myoffset) rank++;
 return rank;
}

cluster.c#clusterRequestFailoverAuth()函数的逻辑也不复杂，可以看到主要分为了两步：

调用 clusterBuildMessageHdr() 函数构造 CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 消息
再调用 clusterBroadcastMessage() 函数将请求投票的消息发送到集群中所有节点

void clusterRequestFailoverAuth(void) {
 clusterMsg buf[1];
 clusterMsg *hdr = (clusterMsg*) buf;
 uint32_t totlen;

 clusterBuildMessageHdr(hdr,CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST);
 /* If this is a manual failover, set the CLUSTERMSG_FLAG0_FORCEACK bit
  * in the header to communicate the nodes receiving the message that
  * they should authorized the failover even if the master is working. */
 if (server.cluster->mf_end) hdr->mflags[0] |= CLUSTERMSG_FLAG0_FORCEACK;
 totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
 hdr->totlen = htonl(totlen);
 clusterBroadcastMessage(buf,totlen);
}


void clusterBroadcastMessage(void *buf, size_t len) {
 dictIterator *di;
 dictEntry *de;

 di = dictGetSafeIterator(server.cluster->nodes);
 while((de = dictNext(di)) != NULL) {
     clusterNode *node = dictGetVal(de);

     if (!node->link) continue;
     if (node->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_HANDSHAKE))
         continue;
     clusterSendMessage(node->link,buf,len);
 }
 dictReleaseIterator(di);
}

2.2 Master 节点投票
在Redis 6.0 源码阅读笔记(12)-Redis 集群建立流程一文中我们已经提过 redis 处理集群消息的函数为 cluster.c#clusterProcessPacket()，此处我们也省略与本节无关的部分，可以看到处理如下：

根据消息报头的 type 变量确定消息类型，从而进行相应的解析处理
对于 CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 消息，调用clusterSendFailoverAuthIfNeeded() 函数进行处理

int clusterProcessPacket(clusterLink *link) {
 clusterMsg *hdr = (clusterMsg*) link->rcvbuf;
 uint32_t totlen = ntohl(hdr->totlen);
 uint16_t type = ntohs(hdr->type);
 mstime_t now = mstime();

 ......
 
 else if (type == CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST ||
            type == CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK ||
            type == CLUSTERMSG_TYPE_MFSTART)
 {
     uint32_t explen = sizeof(clusterMsg)-sizeof(union clusterMsgData);

     if (totlen != explen) return 1;

......

 } else if (type == CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST) {
     if (!sender) return 1;  /* We don't know that node. */
     clusterSendFailoverAuthIfNeeded(sender,hdr);
 } else if (type == CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK) {
     if (!sender) return 1;  /* We don't know that node. */
     /* We consider this vote only if the sender is a master serving
      * a non zero number of slots, and its currentEpoch is greater or
      * equal to epoch where this node started the election. */
     if (nodeIsMaster(sender) && sender->numslots > 0 &&
         senderCurrentEpoch >= server.cluster->failover_auth_epoch)
     {
         server.cluster->failover_auth_count++;
         /* Maybe we reached a quorum here, set a flag to make sure
          * we check ASAP. */
         clusterDoBeforeSleep(CLUSTER_TODO_HANDLE_FAILOVER);
     }
 } 

......
}

cluster.c#clusterSendFailoverAuthIfNeeded() 函数会对请求投票的消息进行一些判断，最后决定是否投票，其处理如下：

如果本节点为 Slave 节点或者本节点虽然是 Master 但是没有负责任何一个 Slot，则无权投票，直接 return
如果请求携带过来的 requestCurrentEpoch 变量小于本节点的 currentEpoch 变量，则说明发起投票的请求已经过期了，不需要处理
如果本节点记录上次投票的版本的 lastVoteEpoch 与 currentEpoch 变量相等，说明本节点已经投过票了，也不需要处理
检查发起投票请求的 Slave 节点的 Master 是否是 FAIL 状态
检查集群中所有 Slot，如果存在某个 Slot 的 configEpoch 大于请求节点带过来的 requestConfigEpoch ，说明投票已经过期，不用处理
通过以上检查，Master 节点确定可以对请求投票的 Slave 节点投票，调用 clusterSendFailoverAuth() 响应消息

void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
 clusterNode *master = node->slaveof;
 uint64_t requestCurrentEpoch = ntohu64(request->currentEpoch);
 uint64_t requestConfigEpoch = ntohu64(request->configEpoch);
 unsigned char *claimed_slots = request->myslots;
 int force_ack = request->mflags[0] & CLUSTERMSG_FLAG0_FORCEACK;
 int j;

 /* IF we are not a master serving at least 1 slot, we don't have the
  * right to vote, as the cluster size in Redis Cluster is the number
  * of masters serving at least one slot, and quorum is the cluster
  * size + 1 */
 if (nodeIsSlave(myself) || myself->numslots == 0) return;

 /* Request epoch must be >= our currentEpoch.
  * Note that it is impossible for it to actually be greater since
  * our currentEpoch was updated as a side effect of receiving this
  * request, if the request epoch was greater. */
 if (requestCurrentEpoch < server.cluster->currentEpoch) {
     serverLog(LL_WARNING,
         "Failover auth denied to %.40s: reqEpoch (%llu) < curEpoch(%llu)",
         node->name,
         (unsigned long long) requestCurrentEpoch,
         (unsigned long long) server.cluster->currentEpoch);
     return;
 }

 /* I already voted for this epoch? Return ASAP. */
 if (server.cluster->lastVoteEpoch == server.cluster->currentEpoch) {
     serverLog(LL_WARNING,
             "Failover auth denied to %.40s: already voted for epoch %llu",
             node->name,
             (unsigned long long) server.cluster->currentEpoch);
     return;
 }

 /* Node must be a slave and its master down.
  * The master can be non failing if the request is flagged
  * with CLUSTERMSG_FLAG0_FORCEACK (manual failover). */
 if (nodeIsMaster(node) || master == NULL ||
     (!nodeFailed(master) && !force_ack))
 {
     if (nodeIsMaster(node)) {
         serverLog(LL_WARNING,
                 "Failover auth denied to %.40s: it is a master node",
                 node->name);
     } else if (master == NULL) {
         serverLog(LL_WARNING,
                 "Failover auth denied to %.40s: I don't know its master",
                 node->name);
     } else if (!nodeFailed(master)) {
         serverLog(LL_WARNING,
                 "Failover auth denied to %.40s: its master is up",
                 node->name);
     }
     return;
 }

 /* We did not voted for a slave about this master for two
  * times the node timeout. This is not strictly needed for correctness
  * of the algorithm but makes the base case more linear. */
 if (mstime() - node->slaveof->voted_time < server.cluster_node_timeout * 2)
 {
     serverLog(LL_WARNING,
             "Failover auth denied to %.40s: "
             "can't vote about this master before %lld milliseconds",
             node->name,
             (long long) ((server.cluster_node_timeout*2)-
                          (mstime() - node->slaveof->voted_time)));
     return;
 }

 /* The slave requesting the vote must have a configEpoch for the claimed
  * slots that is >= the one of the masters currently serving the same
  * slots in the current configuration. */
 for (j = 0; j < CLUSTER_SLOTS; j++) {
     if (bitmapTestBit(claimed_slots, j) == 0) continue;
     if (server.cluster->slots[j] == NULL ||
         server.cluster->slots[j]->configEpoch <= requestConfigEpoch)
     {
         continue;
     }
     /* If we reached this point we found a slot that in our current slots
      * is served by a master with a greater configEpoch than the one claimed
      * by the slave requesting our vote. Refuse to vote for this slave. */
     serverLog(LL_WARNING,
             "Failover auth denied to %.40s: "
             "slot %d epoch (%llu) > reqEpoch (%llu)",
             node->name, j,
             (unsigned long long) server.cluster->slots[j]->configEpoch,
             (unsigned long long) requestConfigEpoch);
     return;
 }

 /* We can vote for this slave. */
 server.cluster->lastVoteEpoch = server.cluster->currentEpoch;
 node->slaveof->voted_time = mstime();
 clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|CLUSTER_TODO_FSYNC_CONFIG);
 clusterSendFailoverAuth(node);
 serverLog(LL_WARNING, "Failover auth granted to %.40s for epoch %llu",
     node->name, (unsigned long long) server.cluster->currentEpoch);
}

cluster.c#clusterSendFailoverAuth()函数的处理比较简单，就是回复 Slave 节点一个 CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 消息，表示同意给请求投票的 Slave 节点投票

/* Send a FAILOVER_AUTH_ACK message to the specified node. */
 void clusterSendFailoverAuth(clusterNode *node) {
 clusterMsg buf[1];
 clusterMsg *hdr = (clusterMsg*) buf;
 uint32_t totlen;

 if (!node->link) return;
 clusterBuildMessageHdr(hdr,CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK);
 totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
 hdr->totlen = htonl(totlen);
 clusterSendMessage(node->link,(unsigned char*)buf,totlen);
}

2.3 Slave 节点计票
与上一节 Master 处理 CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 消息类似，Slave 节点处理 CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 消息也在 cluster.c#clusterProcessPacket() 函数中

可以看到对于 CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 消息的处理首先要检查投票的节点是不是 Master 节点，另外还要确定其 senderCurrentEpoch 不小于本节点发起选举的时代 failover_auth_epoch。经过以上校验，认为收到的投票是合法的，server.cluster->failover_auth_count 计票器自增
调用 clusterDoBeforeSleep() 函数设置在集群定时任务之前需要处理的任务为 CLUSTER_TODO_HANDLE_FAILOVER

int clusterProcessPacket(clusterLink *link) {
 clusterMsg *hdr = (clusterMsg*) link->rcvbuf;
 uint32_t totlen = ntohl(hdr->totlen);
 uint16_t type = ntohs(hdr->type);
 mstime_t now = mstime();

 ......
 
 else if (type == CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST ||
            type == CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK ||
            type == CLUSTERMSG_TYPE_MFSTART)
 {
     uint32_t explen = sizeof(clusterMsg)-sizeof(union clusterMsgData);

     if (totlen != explen) return 1;

......

 } else if (type == CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST) {
     if (!sender) return 1;  /* We don't know that node. */
     clusterSendFailoverAuthIfNeeded(sender,hdr);
 } else if (type == CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK) {
     if (!sender) return 1;  /* We don't know that node. */
     /* We consider this vote only if the sender is a master serving
      * a non zero number of slots, and its currentEpoch is greater or
      * equal to epoch where this node started the election. */
     if (nodeIsMaster(sender) && sender->numslots > 0 &&
         senderCurrentEpoch >= server.cluster->failover_auth_epoch)
     {
         server.cluster->failover_auth_count++;
         /* Maybe we reached a quorum here, set a flag to make sure
          * we check ASAP. */
         clusterDoBeforeSleep(CLUSTER_TODO_HANDLE_FAILOVER);
     }
 } 

......
}

对于 CLUSTER_TODO_HANDLE_FAILOVER 任务，会在 redis 集群定时任务处理前的 cluster.c#clusterBeforeSleep() 函数进行处理，逻辑比较简单，就是调用 cluster.c#clusterHandleSlaveFailover() 函数

void clusterBeforeSleep(void) {
 /* Handle failover, this is needed when it is likely that there is already
  * the quorum from masters in order to react fast. */
 if (server.cluster->todo_before_sleep & CLUSTER_TODO_HANDLE_FAILOVER)
     clusterHandleSlaveFailover();

 /* Update the cluster state. */
 if (server.cluster->todo_before_sleep & CLUSTER_TODO_UPDATE_STATE)
     clusterUpdateState();

 /* Save the config, possibly using fsync. */
 if (server.cluster->todo_before_sleep & CLUSTER_TODO_SAVE_CONFIG) {
     int fsync = server.cluster->todo_before_sleep &
                 CLUSTER_TODO_FSYNC_CONFIG;
     clusterSaveConfigOrDie(fsync);
 }

 /* Reset our flags (not strictly needed since every single function
  * called for flags set should be able to clear its flag). */
 server.cluster->todo_before_sleep = 0;
}

接上 2.1 节第 2 个步骤 cluster.c#clusterHandleSlaveFailover() 函数发起选举后 return 的部分，可以看到其主要在检查 server.cluster->failover_auth_count 计票器获得的投票是否大于 needed_quorum 变量，如大于则调用 clusterFailoverReplaceYourMaster() 函数完成主从切换

void clusterHandleSlaveFailover(void) {

......
 /* Ask for votes if needed. */
if (server.cluster->failover_auth_sent == 0) {
  server.cluster->currentEpoch++;
  server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
  serverLog(LL_WARNING,"Starting a failover election for epoch %llu.",
      (unsigned long long) server.cluster->currentEpoch);
  clusterRequestFailoverAuth();
  server.cluster->failover_auth_sent = 1;
  clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
                       CLUSTER_TODO_UPDATE_STATE|
                       CLUSTER_TODO_FSYNC_CONFIG);
  return; /* Wait for replies. */
}

/* Check if we reached the quorum. */
if (server.cluster->failover_auth_count >= needed_quorum) {
  /* We have the quorum, we can finally failover the master. */

  serverLog(LL_WARNING,
      "Failover election won: I'm the new master.");

  /* Update my configEpoch to the epoch of the election. */
  if (myself->configEpoch < server.cluster->failover_auth_epoch) {
      myself->configEpoch = server.cluster->failover_auth_epoch;
      serverLog(LL_WARNING,
          "configEpoch set to %llu after successful failover",
          (unsigned long long) myself->configEpoch);
  }

  /* Take responsibility for the cluster slots. */
  clusterFailoverReplaceYourMaster();
} else {
  clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_VOTES);
}

cluster.c#clusterFailoverReplaceYourMaster() 函数的处理在注释中写的非常清楚，此处不再赘述，至此 Redis 集群节点选举流程结束

void clusterFailoverReplaceYourMaster(void) {
 int j;
 clusterNode *oldmaster = myself->slaveof;

 if (nodeIsMaster(myself) || oldmaster == NULL) return;

 /* 1) Turn this node into a master. */
 clusterSetNodeAsMaster(myself);
 replicationUnsetMaster();

 /* 2) Claim all the slots assigned to our master. */
 for (j = 0; j < CLUSTER_SLOTS; j++) {
     if (clusterNodeGetSlotBit(oldmaster,j)) {
         clusterDelSlot(j);
         clusterAddSlot(myself,j);
     }
 }

 /* 3) Update state and save config. */
 clusterUpdateState();
 clusterSaveConfigOrDie(1);

 /* 4) Pong all the other nodes so that they can update the state
  *    accordingly and detect that we switched to master role. */
 clusterBroadcastPong(CLUSTER_BROADCAST_ALL);

 /* 5) If there was a manual failover in progress, clear the state. */
 resetManualFailover();
 }

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。