1、 头领选举
在(29)中解析了主客观下线的方法,在解析客观下线的时候没有解析哨兵间同步数据的方式。这个方式与头领选举时同步数据的方式相同,所以将其放到本文来解析。
哨兵间选举头领使用的是Raft算法。所以需要先简单介绍一下raft的选举算法。在raft算法中服务器被分为了三种角色:Leader, Follower, Candidate。其中Candidate是候选者,只在选举过程中出现。同时使用epoch表示选举纪元,例如第一次选举epoch为1,第二次选举epoch为2。哨兵在选举时遵循相同的规则:在同一纪元中只对一个头领投票,投票的头领是其最先收到的候选者。
选举流程:首先在一开始Raft算法中所有的节点都是Follower,然后根据某种机制来触发选举,一般来说是心跳机制,但redis的哨兵不是,它使用的是客观下线。触发了选举机制的节点会将自身的身份变为Candidate,然后向其他节点发送投票请求。如果在一定时间内选出了Leader,那么选举结束。如果没有选出,则开启下一轮选举。
哨兵在实现raft算法时,与一般的实现有所区别。在接下来的源码中会详细解析,首先接着(29)继续分析。在sentinelHandleRedisInstance方法中执行完客观下线检查后会继续执行一个if语句,如下图:
if语句中有了一个sentinelStartFailoverIfNeeded方法,这个方法在服务器主观下线后会开启故障迁移,故障迁移的第一步是在哨兵中选举出一个头领,然后由这个头领来执行对主从服务器的故障迁移。如果开始了故障迁移,那么该方法的返回为true。然后执行sentinelAskMasterStateToOtherSentinels方法,这个方法会向其他哨兵服务器发送请求同步器数据或者进行选举。最后还有一个sentinelFailoverStateMachine方法,这是一个实现了类似于状态机机制的方法,它会根据故障迁移执行的阶段来执行不同的方法。
上述三个方法会逐个详细解析,这里先解析sentinelStartFailoverIfNeeded方法,其内容如下:
/* This function checks if there are the conditions to start the failover,
* that is:
*
* 1) Master must be in ODOWN condition.
* 2) No failover already in progress.
* 3) No failover already attempted recently.
*
* We still don't know if we'll win the election so it is possible that we
* start the failover but that we'll not be able to act.
*
* Return non-zero if a failover was started. */
int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {
/* We can't failover if the master is not in O_DOWN state. */
if (!(master->flags & SRI_O_DOWN)) return 0;
/* Failover already in progress? */
if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;
/* Last failover attempt started too little time ago? */
if (mstime() - master->failover_start_time <
master->failover_timeout*2)
{
if (master->failover_delay_logged != master->failover_start_time) {
time_t clock = (master->failover_start_time +
master->failover_timeout*2) / 1000;
char ctimebuf[26];
ctime_r(&clock,ctimebuf);
ctimebuf[24] = '\0'; /* Remove newline. */
master->failover_delay_logged = master->failover_start_time;
serverLog(LL_WARNING,
"Next failover delay: I will not start a failover before %s",
ctimebuf);
}
return 0;
}
sentinelStartFailover(master);
return 1;
}
这个方法很简单,三个if条件代表三种不会执行故障转移的条件。第一种是主服务器不是客观下线;第二种是已经开始进行故障转移;第三种是上一次尝试开始故障转移的时间距离现在很近。
如果不是上述三种情况代表着需要开始进行故障转移,它会调用sentinelStartFailover方法将故障转移的状态设置为SENTINEL_FAILOVER_STATE_WAIT_START。其具体内容如下:
/* Setup the master state to start a failover. */
void sentinelStartFailover(sentinelRedisInstance *master) {
serverAssert(master->flags & SRI_MASTER);
master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
master->flags |= SRI_FAILOVER_IN_PROGRESS;
master->failover_epoch = ++sentinel.current_epoch;
sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
(unsigned long long) sentinel.current_epoch);
sentinelEvent(LL_WARNING,"+try-failover",master,"%@");
master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
master->failover_state_change_time = mstime();
}
这个方法主要是对一些参数进行赋值,首先是第5行的failover_state,这个参数代表了故障转移的状态,整个故障转移有多个步骤,redis将不同的步骤设置成了不同的状态,在服务器运行时会根据不同的状态来执行对应步骤的方法。
然后是第6行的flags,这里可以和上面sentinelStartFailoverIfNeeded中的第二个if条件有联系。防止多次重复进行故障转移。
然后是第7行的failover_epoch,这个参数代表了raft算法中的epoch。
最后是记录了两个时间failover_start_time和failover_start_time。第一个代表着故障转移开始的时间,第二个是故障转移状态变化的时间。
这个方法执行完成后,sentinelStartFailoverIfNeeded也就解析完了。如果这个方法的返回值为1,那么就会执行sentinelAskMasterStateToOtherSentinels方法,开始与其他哨兵通信。这一步在raft的选举中相当于开始投票了。在raft算法中选举需要Candidate,Candidate是由Follower转换来到,一般是根据心跳机制来确定的,当某个Follower发现与leader的心跳超时后,它就会将自身转换成Candidate开启新一轮选举。
而在哨兵中这个机制是客观下线,将Follower转换成candidate的方法则是上述解析的sentinelStartFailoverIfNeeded方法。转换成candidate后会立马想其他哨兵服务器发送投票命令,这步是通过sentinelAskMasterStateToOtherSentinels方法来实现的,其具体内容如下:
/* If we think the master is down, we start sending
* SENTINEL IS-MASTER-DOWN-BY-ADDR requests to other sentinels
* in order to get the replies that allow to reach the quorum
* needed to mark the master in ODOWN state and trigger a failover. */
#define SENTINEL_ASK_FORCED (1<<0)
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {
dictIterator *di;
dictEntry *de;
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
char port[32];
int retval;
/* If the master state from other sentinel is too old, we clear it. */
if (elapsed > SENTINEL_ASK_PERIOD*5) {
ri->flags &= ~SRI_MASTER_DOWN;
sdsfree(ri->leader);
ri->leader = NULL;
}
/* Only ask if master is down to other sentinels if:
*
* 1) We believe it is down, or there is a failover in progress.
* 2) Sentinel is connected.
* 3) We did not receive the info within SENTINEL_ASK_PERIOD ms. */
if ((master->flags & SRI_S_DOWN) == 0) continue;
if (ri->link->disconnected) continue;
if (!(flags & SENTINEL_ASK_FORCED) &&
mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
continue;
/* Ask */
ll2string(port,sizeof(port),master->addr->port);
retval = redisAsyncCommand(ri->link->cc,
sentinelReceiveIsMasterDownReply, ri,
"%s is-master-down-by-addr %s %s %llu %s",
sentinelInstanceMapCommand(ri,"SENTINEL"),
master->addr->ip, port,
sentinel.current_epoch,
(master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
sentinel.myid : "*");
if (retval == C_OK) ri->link->pending_commands++;
}
dictReleaseIterator(di);
}
这个方法实际很简单,首先取出存储了的所有哨兵(第10行),然后使用while循环遍历所有的哨兵(第11行),并调用redisAsyncCommand方法(第37行)向该哨兵发送数据。
redisAsyncCommand方法在(27)中解析过,这里主要解析其发送的参数。其中有两个参数较为重要,一个是处理返回值的sentinelReceiveIsMasterDownReply方法,还有一个是发送的参数"%s is-master-down-by-addr %s %s %llu %s"。
首先解析其发送的参数。参数中的第一个%s的值是第40行方法执行的结果,sentinelInstanceMapCommand方法在之前也提到过,它主要是为了解决命令重命名的问题,若没有修改过命令的名称,那么它的值为SENTINEL。第二个%s是主服务器的ip,第三个%s主服务器的port,第四个参数(%llu)是当前选举的纪元。最后一个%s有些特殊,它分为两种情况,一种是开始故障转移后发送的是其服务器id即sentinel.myid,否则发送“*”。在第43行出现的master->failover_state参数,在上文解析sentinelStartFailover方法中有提到,它会被赋予一个新的值:SENTINEL_FAILOVER_STATE_WAIT_START。这个参数的实际值为1,代表着哨兵选举这个阶段,而比较的另一个参数SENTINEL_FAILOVER_STATE_NONE的值实际为0,代表着未进行故障转移。故障转移中还有其他阶段和对应的参数,但都是大于0的参数。
综上所述,在这时哨兵发送的命令如下:
SENTINEL is-master-down-by-addr <ip> <port> <current_ epoch> < runid>
这里需要注意的是这个命令是发送给其他哨兵服务器。在(26)中解析了哨兵的启动方式,其实际使用的是redis数据库的启动方法,所以其处理命令的方法与redis数据库相同,区别在于其实际使用的命令不同。在(26)中提到了在哨兵启动中会调用initSentinel,这个方法如下:
/* Perform the Sentinel mode initialization. */
void initSentinel(void) {
unsigned int j;
/* Remove usual Redis commands from the command table, then just add
* the SENTINEL command. */
dictEmpty(server.commands,NULL);
for (j = 0; j < sizeof(sentinelcmds)/sizeof(sentinelcmds[0]); j++) {
int retval;
struct redisCommand *cmd = sentinelcmds+j;
retval = dictAdd(server.commands, sdsnew(cmd->name), cmd);
serverAssert(retval == DICT_OK);
}
/* Initialize various data structures. */
sentinel.current_epoch = 0;
sentinel.masters = dictCreate(&instancesDictType,NULL);
sentinel.tilt = 0;
sentinel.tilt_start_time = 0;
sentinel.previous_time = mstime();
sentinel.running_scripts = 0;
sentinel.scripts_queue = listCreate();
sentinel.announce_ip = NULL;
sentinel.announce_port = 0;
sentinel.simfailure_flags = SENTINEL_SIMFAILURE_NONE;
sentinel.deny_scripts_reconfig = SENTINEL_DEFAULT_DENY_SCRIPTS_RECONFIG;
memset(sentinel.myid,0,sizeof(sentinel.myid));
}
在第7到14行,这里会删除原来存储的命令,然后从一个名叫sentinelcmds的参数中循环遍历数据,添加成新的命令。sentinelcmds参数的内容如下:
struct redisCommand sentinelcmds[] = {
{"ping",pingCommand,1,"",0,NULL,0,0,0,0,0},
{"sentinel",sentinelCommand,-2,"",0,NULL,0,0,0,0,0},
{"subscribe",subscribeCommand,-2,"",0,NULL,0,0,0,0,0},
{"unsubscribe",unsubscribeCommand,-1,"",0,NULL,0,0,0,0,0},
{"psubscribe",psubscribeCommand,-2,"",0,NULL,0,0,0,0,0},
{"punsubscribe",punsubscribeCommand,-1,"",0,NULL,0,0,0,0,0},
{"publish",sentinelPublishCommand,3,"",0,NULL,0,0,0,0,0},
{"info",sentinelInfoCommand,-1,"",0,NULL,0,0,0,0,0},
{"role",sentinelRoleCommand,1,"l",0,NULL,0,0,0,0,0},
{"client",clientCommand,-2,"rs",0,NULL,0,0,0,0,0},
{"shutdown",shutdownCommand,-1,"",0,NULL,0,0,0,0,0},
{"auth",authCommand,2,"sltF",0,NULL,0,0,0,0,0}
};
这里我们可以看见SENTINEL命令调用的是一个名叫sentinelCommand的方法来处理的,这个方法处理is-master-down-by-addr参数的内容如下:
void sentinelCommand(client *c) {
if (!strcasecmp(c->argv[1]->ptr,"masters")) {
...
} else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
/* SENTINEL IS-MASTER-DOWN-BY-ADDR <ip> <port> <current-epoch> <runid>
*
* Arguments:
*
* ip and port are the ip and port of the master we want to be
* checked by Sentinel. Note that the command will not check by
* name but just by master, in theory different Sentinels may monitor
* differnet masters with the same name.
*
* current-epoch is needed in order to understand if we are allowed
* to vote for a failover leader or not. Each Sentinel can vote just
* one time per epoch.
*
* runid is "*" if we are not seeking for a vote from the Sentinel
* in order to elect the failover leader. Otherwise it is set to the
* runid we want the Sentinel to vote if it did not already voted.
*/
sentinelRedisInstance *ri;
long long req_epoch;
uint64_t leader_epoch = 0;
char *leader = NULL;
long port;
int isdown = 0;
if (c->argc != 6) goto numargserr;
if (getLongFromObjectOrReply(c,c->argv[3],&port,NULL) != C_OK ||
getLongLongFromObjectOrReply(c,c->argv[4],&req_epoch,NULL)
!= C_OK)
return;
ri = getSentinelRedisInstanceByAddrAndRunID(sentinel.masters,
c->argv[2]->ptr,port,NULL);
/* It exists? Is actually a master? Is subjectively down? It's down.
* Note: if we are in tilt mode we always reply with "0". */
if (!sentinel.tilt && ri && (ri->flags & SRI_S_DOWN) &&
(ri->flags & SRI_MASTER))
isdown = 1;
/* Vote for the master (or fetch the previous vote) if the request
* includes a runid, otherwise the sender is not seeking for a vote. */
if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
c->argv[5]->ptr,
&leader_epoch);
}
/* Reply with a three-elements multi-bulk reply:
* down state, leader, vote epoch. */
addReplyMultiBulkLen(c,3);
addReply(c, isdown ? shared.cone : shared.czero);
addReplyBulkCString(c, leader ? leader : "*");
addReplyLongLong(c, (long long)leader_epoch);
if (leader) sdsfree(leader);
} else if (!strcasecmp(c->argv[1]->ptr,"reset")) {
...
}
在上文解析命令的对最后一个参数runid有两个取值,一个是在哨兵选举的时候发送的runid,另一个是其他时候发送的“”。在发送“”的时候这个命令的主要作用是同步哨兵间的主观下线状态。在之前解析客观下线时提到的数据同步过程便是这时进行的,而这个方法在实际运行的时候是被循环调用的,所以其实是实时同步的。
上述代码中,首先是24到37行,这里主要在解析命令的参数,并通过参数找到对应的服务器实例(ri参数)。然后是第41行,这里是在检查该哨兵的主观下线判断,这里可以看见它判断的方式是直接检查对应ri的标识。然后是47行对应了选举时的处理方式,这里主要通过sentinelVoteLeader方法进行选举投票。最后是55行到59行,这里会向发送命令的服务器返回一些数据。数据有三个,首先是第56行返回的是主观下线状态(0或1),然后是57行这里返回选举的结果,最后是58行这里返回的是选举的纪元。
选举投票调用的sentinelVoteLeader方法如下:
char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
if (req_epoch > sentinel.current_epoch) {
sentinel.current_epoch = req_epoch;
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
(unsigned long long) sentinel.current_epoch);
}
if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)
{
sdsfree(master->leader);
master->leader = sdsnew(req_runid);
master->leader_epoch = sentinel.current_epoch;
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+vote-for-leader",master,"%s %llu",
master->leader, (unsigned long long) master->leader_epoch);
/* If we did not voted for ourselves, set the master failover start
* time to now, in order to force a delay before we can start a
* failover for the same master. */
if (strcasecmp(master->leader,sentinel.myid))
master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
}
*leader_epoch = master->leader_epoch;
return master->leader ? sdsnew(master->leader) : NULL;
}
这里投票的方式遵循raft算法的投票方式,即在同一纪元中将票投给其最先接收到的候选者,并且同一纪元中只投一次票。
首先是第2行的if语句,这里会比较命令请求的纪元和其自身存储的纪元,若请求的纪元更大,则代表开启了新一轮选举,那么这里它需要更新自己的自己的纪元并进行投票。在第2行的代码中只是进行了纪元的更新(第3行)。然后是第9行的if语句,这里才是进行投票的地方,投票的方式很简单,即记录runid(第12行)。最后再返回投票的结果(即master->leader参数)。
解析完接收命令这一端的操作后,我们继续返回发送命令的哨兵服务器,查看它如何处理命令的返回值。
在上文我们提到了在发送命令的时候,其注册了sentinelReceiveIsMasterDownReply方法来处理命令的返回值,这两个方法的内容如下:
void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) {
sentinelRedisInstance *ri = privdata;
instanceLink *link = c->data;
redisReply *r;
if (!reply || !link) return;
link->pending_commands--;
r = reply;
/* Ignore every error or unexpected reply.
* Note that if the command returns an error for any reason we'll
* end clearing the SRI_MASTER_DOWN flag for timeout anyway. */
if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 &&
r->element[0]->type == REDIS_REPLY_INTEGER &&
r->element[1]->type == REDIS_REPLY_STRING &&
r->element[2]->type == REDIS_REPLY_INTEGER)
{
ri->last_master_down_reply_time = mstime();
if (r->element[0]->integer == 1) {
ri->flags |= SRI_MASTER_DOWN;
} else {
ri->flags &= ~SRI_MASTER_DOWN;
}
if (strcmp(r->element[1]->str,"*")) {
/* If the runid in the reply is not "*" the Sentinel actually
* replied with a vote. */
sdsfree(ri->leader);
if ((long long)ri->leader_epoch != r->element[2]->integer)
serverLog(LL_WARNING,
"%s voted for %s %llu", ri->name,
r->element[1]->str,
(unsigned long long) r->element[2]->integer);
ri->leader = sdsnew(r->element[1]->str);
ri->leader_epoch = r->element[2]->integer;
}
}
}
通过之前的解析可以知道,返回值有三个,第一个是该哨兵的主观下线的状态,对这个参数的处理在第19行,这里主要是赋值操作。然后是第二个参数他返回的选举的投票的结果,从之前的代码可以知道,在未选举的时候返回的是“*”,选举时返回的选举的服务器的runid。当返回的是选举的runid的时候,24行中的代码会被执行,这里主要也是赋值操作(第33,34行)。