一、zookeeper选主逻辑主要是根据投票数来定的,具体的逻辑如下:

判断依据:
1、Epoch:leader的任期,任期大的优先级高,其他的节点优先投票给任期大的节点
2、ZXID:zookeeper事务ID,越大表示数据越新,在任期相同时则比较zxid
3、SID:集群中每个节点的唯一编号,当任期、事务id都相同的时候则比较该值,sid越大的优先获得其他节点的投票

二:zookeeper选主源代码分析:

public Vote lookForLeader() throws InterruptedException

选主的入口主要是集中在这儿:

synchronized (this) {
                logicalclock.incrementAndGet();
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

其中 logicalclock 表示选举轮数,每一次选举该值都会增加,表示选举轮数增加。我们跟踪进 在这里updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());插入代码片 方法去看一下:

synchronized void updateProposal(long leader, long zxid, long epoch) {
        LOG.debug(
            "Updating proposal: {} (newleader), 0x{} (newzxid), {} (oldleader), 0x{} (oldzxid)",
            leader,
            Long.toHexString(zxid),
            proposedLeader,
            Long.toHexString(proposedZxid));

        proposedLeader = leader;
        proposedZxid = zxid;
        proposedEpoch = epoch;
    }

可以看到,当投票开始的时候,这里所作的操作就是初始化选举信息,
具体的投票操作是在 sendNotifications(); 这里所作的操作就是将自己的选举票发送给自己将要选举的节点,其代码如下:

private void sendNotifications() {
        for (long sid : self.getCurrentAndNextConfigVoters()) {
            QuorumVerifier qv = self.getQuorumVerifier();
            ToSend notmsg = new ToSend(
                ToSend.mType.notification,
                proposedLeader,
                proposedZxid,
                logicalclock.get(),
                QuorumPeer.ServerState.LOOKING,
                sid,
                proposedEpoch,
                qv.toString().getBytes(UTF_8));

            LOG.debug(
                "Sending Notification: {} (n.leader), 0x{} (n.zxid), 0x{} (n.round), {} (recipient),"
                    + " {} (myid), 0x{} (n.peerEpoch) ",
                proposedLeader,
                Long.toHexString(proposedZxid),
                Long.toHexString(logicalclock.get()),
                sid,
                self.getId(),
                Long.toHexString(proposedEpoch));

            sendqueue.offer(notmsg);
        }
    }
offer 方法跟踪查看:
public boolean offer(E e) {
        if (e == null) throw new NullPointerException();
        final AtomicInteger count = this.count;
        if (count.get() == capacity)
            return false;
        final int c;
        final Node<E> node = new Node<E>(e);
        final ReentrantLock putLock = this.putLock;
        putLock.lock();
        try {
            if (count.get() == capacity)
                return false;
            enqueue(node);
            c = count.getAndIncrement();
            if (c + 1 < capacity)
                notFull.signal();
        } finally {
            putLock.unlock();
        }
        if (c == 0)
            signalNotEmpty();
        return true;
    }

offer方法所做的就是将要投票的信息添加到队列当中去,等待调度

looking选举代码:

case LOOKING:
                        if (getInitLastLoggedZxid() == -1) {
                            LOG.debug("Ignoring notification as our zxid is -1");
                            break;
                        }
                        if (n.zxid == -1) {
                            LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
                            break;
                        }
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                            }
                            sendNotifications();

若果当前节点发现自己的选举轮数比对象节点的选举轮数小,则就放弃自己的选票并将自己的票投给epoch id较大的一方。如若不是便判断当前投票是否成功了,如果成功了则更新当前投票结果,倘若不成功,则获取之前数据再次重新投票。另外,当当前的节点的选举轮数小于在记的任期时,则直接结束,因为它的任期小,期间可能因为网络分区的原因导致其数据过于老旧,这个时候只能等待leader通知它进行数据更新。

投票之后将选票信息添加到recvSet中。

// don't care about the version if it's in LOOKING state
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));

如果已经有leader节点被选出来了,但是当前节点不知道的话就会进入下面这段逻辑当中去:

private Vote receivedLeadingNotification(Map<Long, Vote> recvset, Map<Long, Vote> outofelection, SyncedLearnerTracker voteSet, Notification n) {
        /*
        *
        * In a two-node configuration, a recovery nodes cannot locate a leader because of the lack of the majority in the voteset.
        * Therefore, it is the time for Oracle to take place as a tight breaker.
        *
        * */
        Vote result = receivedFollowingNotification(recvset, outofelection, voteSet, n);
        if (result == null) {
            /*
            * Ask Oracle to see if it is okay to follow this leader.
            *
            * We don't need the CheckLeader() because itself cannot be a leader candidate
            * */
            if (self.getQuorumVerifier().getNeedOracle() && !self.getQuorumVerifier().askOracle()) {
                LOG.info("Oracle indicates to follow");
                setPeerState(n.leader, voteSet);
                Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                leaveInstance(endVote); // 结束选举
                return endVote;
            } else {
                LOG.info("Oracle indicates not to follow");
                return null;
            }
        } else {
            return result;
        }
    }