1. 概念
官网原话
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
解释
ZooKeeper是用于维护配置信息,命名,提供分布式同步和提供组服务的集中式服务。
2. 流程
3. ZAB协议
一种分布式事物数据一致性协议
- ZAB协议全称:Zookeeper Atomic Broadcast(Zookeeper 原子广播协议)
- ZAB协议包含两大部分:崩溃恢复 和 原子广播
- 只有一个单一的客户端服务(Leader)去接受事物请求(proposal)
- Leader需要将数据信息广播同步给所有的Follower
4. 数据存储
4.1 数据节点
zookeeper比较好奇的还是数据存储这一块,其实内部使用的树形结构存储,每个树的节点称为ZNode,ZNode通过路径唯一标识,每个ZNode可以存储少量数据(默认是1M,可通过配置修改)。
路径唯一标识
如:创建路径命令 create /config/db ‘db’
db就是这个路径的唯一标识
4.2 节点类型
- 临时节点(EPHEMERAL):客户端和服务端连接时创建,断开后自动删除,临时节点不能拥有子节点(EPHEMERAL_SEQUENTIAL 临时顺序节点,拥有临时节点特性,并带有序号)
- 持久节点(PERSISTENT):创建后永久存在,除非主动删除(PERSISTENT_SEQUENTIAL 持久顺序节点,拥有持久节点特性,并带有序号)
这里记录一问题,临时和持久,有序和无序如何选择?
个人理解如下:
临时和持久,就看业务系统和服务器具体场景需要了,如果数据变化不大,并且数据重要,zk服务器稳定,可以选择持久(正常情况下一般选择临时)。
有序和无序,如果数据涉及锁,高频新增和修改,都需要使用带序号,因为这个序号会跟分布式锁或事物处理先后逻辑相关。
4.3 节点访问控制(ACL)
ACL的格式由 [schema] : [id] : [acl] 三段组成。
每个节点都有自己单独的ACL配置,子节点不受影响!
schema取值
- world 任何人(id配anyone)
- auth 不需要id或者指定用户
- digest 通过用户名和密码验证(其中密码需要加密,加密方式为先sha1,再base64处理)
- host/ip 通过ip或者ip段验证
id取值
标识身份,值依赖于schema做解析,如用户名 或者 用户名密码 或者 ip。
acl权限取值
- create 创建子节点
- delete 删除子节点
- write 在znode节点上写数据
- read 在znode节点上读数据
- admin 设置acl权限
一般使用cdwra分别表示create, delete, write, read, admin
# 创建节点路径
create /zookeeper/test 'test'
# 设置权限(world)
setAcl /zookeeper/test world:anyone:cdwra
# 设置权限(auth)
# 需要先认证用户
addauth digest user1:123456
addauth digest user2:123456
# 赋权
setAcl /zookeeper/test auth:user1:cdwra
# 所有认证用户赋权
setAcl /zookeeper/test auth::cdwra
# 设置权限(digest)
setAcl /zookeeper/test digest:user1:密码(加密):cdwra
## 生成密码密文可通过如下命令
echo -n <user>:<password> | openssl dgst -binary -sha1 | openssl base64
# 设置权限(host/ip)
setAcl /zookeeper/test ip:192.168.0.1:cdwra
setAcl /zookeeper/test ip:192.168.0.1/16:cdwra
4.4 数据对象
zookeeper版本 - 3.4.14
zk的内存数据就是存储在DataTree中,而DataTree中其实使用的是ConcurrentHashMap存储数据的,
key是String类型,即是ZNode的路径唯一标识,
value是DataNode,这个就是数据存储的最小单元。
/**
* 对象内容较多,其他省略了
*/
public class DataTree {
private static final Logger LOG = LoggerFactory.getLogger(DataTree.class);
/**
* This hashtable provides a fast lookup to the datanodes. The tree is the
* source of truth and is where all the locking occurs
*/
private final ConcurrentHashMap<String, DataNode> nodes =
new ConcurrentHashMap<String, DataNode>();
private final WatchManager dataWatches = new WatchManager();
private final WatchManager childWatches = new WatchManager();
/** the root of zookeeper tree */
private static final String rootZookeeper = "/";
/** the zookeeper nodes that acts as the management and status node **/
private static final String procZookeeper = Quotas.procZookeeper;
/** this will be the string thats stored as a child of root */
private static final String procChildZookeeper = procZookeeper.substring(1);
/**
* the zookeeper quota node that acts as the quota management node for
* zookeeper
*/
private static final String quotaZookeeper = Quotas.quotaZookeeper;
/** this will be the string thats stored as a child of /zookeeper */
private static final String quotaChildZookeeper = quotaZookeeper
.substring(procZookeeper.length() + 1);
/**
* the path trie that keeps track fo the quota nodes in this datatree
*/
private final PathTrie pTrie = new PathTrie();
/**
* This hashtable lists the paths of the ephemeral nodes of a session.
*/
private final Map<Long, HashSet<String>> ephemerals =
new ConcurrentHashMap<Long, HashSet<String>>();
private final ReferenceCountedACLCache aclCache = new ReferenceCountedACLCache();
...
}
其中DataNode的源码如下:
DataNode parent 记录上级节点对象
byte data[] 数据存储
Long acl 节点访问控制,每个节点独有,上下级节点不受影响(见4.3)
StatPersisted stat 节点持久化到磁盘的状态(StatPersisted对象大家可以自己去研究一下,大概10来个参数,主要记录当前节点的一些状态信息,像事物id、时间、版本次数等参数记录)
Set<String> children 子节点key集合
package org.apache.zookeeper.server;
import edu.umd.cs.findbugs.annotations.SuppressFBWarnings;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.Collections;
import org.apache.jute.InputArchive;
import org.apache.jute.OutputArchive;
import org.apache.jute.Record;
import org.apache.zookeeper.data.Stat;
import org.apache.zookeeper.data.StatPersisted;
/**
* This class contains the data for a node in the data tree.
* <p>
* A data node contains a reference to its parent, a byte array as its data, an
* array of ACLs, a stat object, and a set of its children's paths.
*
*/
@SuppressFBWarnings("EI_EXPOSE_REP2")
public class DataNode implements Record {
/** the parent of this datanode */
DataNode parent;
/** the data for this datanode */
byte data[];
/**
* the acl map long for this datanode. the datatree has the map
*/
Long acl;
/**
* the stat for this node that is persisted to disk.
*/
public StatPersisted stat;
/**
* the list of children for this node. note that the list of children string
* does not contain the parent path -- just the last part of the path. This
* should be synchronized on except deserializing (for speed up issues).
*/
private Set<String> children = null;
private static final Set<String> EMPTY_SET = Collections.emptySet();
/**
* default constructor for the datanode
*/
DataNode() {
// default constructor
}
/**
* create a DataNode with parent, data, acls and stat
*
* @param parent
* the parent of this DataNode
* @param data
* the data to be set
* @param acl
* the acls for this node
* @param stat
* the stat for this node.
*/
public DataNode(DataNode parent, byte data[], Long acl, StatPersisted stat) {
this.parent = parent;
this.data = data;
this.acl = acl;
this.stat = stat;
}
/**
* Method that inserts a child into the children set
*
* @param child
* to be inserted
* @return true if this set did not already contain the specified element
*/
public synchronized boolean addChild(String child) {
if (children == null) {
// let's be conservative on the typical number of children
children = new HashSet<String>(8);
}
return children.add(child);
}
/**
* Method that removes a child from the children set
*
* @param child
* @return true if this set contained the specified element
*/
public synchronized boolean removeChild(String child) {
if (children == null) {
return false;
}
return children.remove(child);
}
/**
* convenience method for setting the children for this datanode
*
* @param children
*/
public synchronized void setChildren(HashSet<String> children) {
this.children = children;
}
/**
* convenience methods to get the children
*
* @return the children of this datanode
*/
public synchronized Set<String> getChildren() {
if (children == null) {
return EMPTY_SET;
}
return Collections.unmodifiableSet(children);
}
synchronized public void copyStat(Stat to) {
to.setAversion(stat.getAversion());
to.setCtime(stat.getCtime());
to.setCzxid(stat.getCzxid());
to.setMtime(stat.getMtime());
to.setMzxid(stat.getMzxid());
to.setPzxid(stat.getPzxid());
to.setVersion(stat.getVersion());
to.setEphemeralOwner(stat.getEphemeralOwner());
to.setDataLength(data == null ? 0 : data.length);
int numChildren = 0;
if (this.children != null) {
numChildren = children.size();
}
// when we do the Cversion we need to translate from the count of the creates
// to the count of the changes (v3 semantics)
// for every create there is a delete except for the children still present
to.setCversion(stat.getCversion()*2 - numChildren);
to.setNumChildren(numChildren);
}
synchronized public void deserialize(InputArchive archive, String tag)
throws IOException {
archive.startRecord("node");
data = archive.readBuffer("data");
acl = archive.readLong("acl");
stat = new StatPersisted();
stat.deserialize(archive, "statpersisted");
archive.endRecord("node");
}
synchronized public void serialize(OutputArchive archive, String tag)
throws IOException {
archive.startRecord(this, "node");
archive.writeBuffer(data, "data");
archive.writeLong(acl, "acl");
stat.serialize(archive, "statpersisted");
archive.endRecord(this, "node");
}
}
5. 监听(Watcher)
zk的监听会存储在ZKWatchManager的defaultWatcher里面,如下源码
其中materialize方法是为了获取该节点所有的监听。
/**
* Manage watchers & handle events generated by the ClientCnxn object.
*
* We are implementing this as a nested class of ZooKeeper so that
* the public methods will not be exposed as part of the ZooKeeper client
* API.
*/
private static class ZKWatchManager implements ClientWatchManager {
private final Map<String, Set<Watcher>> dataWatches =
new HashMap<String, Set<Watcher>>();
private final Map<String, Set<Watcher>> existWatches =
new HashMap<String, Set<Watcher>>();
private final Map<String, Set<Watcher>> childWatches =
new HashMap<String, Set<Watcher>>();
private volatile Watcher defaultWatcher;
final private void addTo(Set<Watcher> from, Set<Watcher> to) {
if (from != null) {
to.addAll(from);
}
}
/* (non-Javadoc)
* @see org.apache.zookeeper.ClientWatchManager#materialize(Event.KeeperState,
* Event.EventType, java.lang.String)
*/
@Override
public Set<Watcher> materialize(Watcher.Event.KeeperState state,
Watcher.Event.EventType type,
String clientPath)
{
Set<Watcher> result = new HashSet<Watcher>();
switch (type) {
case None:
result.add(defaultWatcher);
boolean clear = ClientCnxn.getDisableAutoResetWatch() &&
state != Watcher.Event.KeeperState.SyncConnected;
synchronized(dataWatches) {
for(Set<Watcher> ws: dataWatches.values()) {
result.addAll(ws);
}
if (clear) {
dataWatches.clear();
}
}
synchronized(existWatches) {
for(Set<Watcher> ws: existWatches.values()) {
result.addAll(ws);
}
if (clear) {
existWatches.clear();
}
}
synchronized(childWatches) {
for(Set<Watcher> ws: childWatches.values()) {
result.addAll(ws);
}
if (clear) {
childWatches.clear();
}
}
return result;
case NodeDataChanged:
case NodeCreated:
synchronized (dataWatches) {
addTo(dataWatches.remove(clientPath), result);
}
synchronized (existWatches) {
addTo(existWatches.remove(clientPath), result);
}
break;
case NodeChildrenChanged:
synchronized (childWatches) {
addTo(childWatches.remove(clientPath), result);
}
break;
case NodeDeleted:
synchronized (dataWatches) {
addTo(dataWatches.remove(clientPath), result);
}
// XXX This shouldn't be needed, but just in case
synchronized (existWatches) {
Set<Watcher> list = existWatches.remove(clientPath);
if (list != null) {
addTo(list, result);
LOG.warn("We are triggering an exists watch for delete! Shouldn't happen!");
}
}
synchronized (childWatches) {
addTo(childWatches.remove(clientPath), result);
}
break;
default:
String msg = "Unhandled watch event type " + type
+ " with state " + state + " on path " + clientPath;
LOG.error(msg);
throw new RuntimeException(msg);
}
return result;
}
}
注册watcher可以通过zookeeper提供的getData、exists 和 getChildren三个方法,注册逻辑基本差不多,看下getData方法
- 通过new DataWatchRegistration(watcher, clientPath)封装watcher
- 通过cnxn.submitRequest(h, request, response, wcb)发送到服务端
public byte[] getData(final String path, Watcher watcher, Stat stat)
throws KeeperException, InterruptedException
{
final String clientPath = path;
PathUtils.validatePath(clientPath);
// the watch contains the un-chroot path
WatchRegistration wcb = null;
if (watcher != null) {
wcb = new DataWatchRegistration(watcher, clientPath);
}
final String serverPath = prependChroot(clientPath);
RequestHeader h = new RequestHeader();
h.setType(ZooDefs.OpCode.getData);
GetDataRequest request = new GetDataRequest();
request.setPath(serverPath);
request.setWatch(watcher != null);
GetDataResponse response = new GetDataResponse();
ReplyHeader r = cnxn.submitRequest(h, request, response, wcb);
if (r.getErr() != 0) {
throw KeeperException.create(KeeperException.Code.get(r.getErr()),
clientPath);
}
if (stat != null) {
DataTree.copyStat(response.getStat(), stat);
}
return response.getData();
}
Watcher注册后做了哪些事情(比如触发和回调),感兴趣的小伙伴可以自己去研究一下~
记录一些Watcher的特性:
- Watcher一旦被触发,就会移除,需要重新注册
- Watcher执行是有顺序性的,多个Watcher会按顺序先后执行
- Watcher触发时,是主动将变更信息推送给客户端的
6. 问题
接下来通过问题的方式去跟踪了解一些其他相关点~
问题1 Leader怎么产生?
zk内部选举主要是使用Paxos算法,不清楚该算法的可以参考一下上一篇用故事的方式说Paxos和Fast Paxos算法
问题2 Leader挂掉怎么办?
这就要提到ZAB协议里面的崩溃恢复了,当Leader服务器出现宕机、网络异常等问题时,就会进入崩溃回复模式,此时停止读取数据和数据同步,进行Leader选举。
待Leader选举成功之后,重新开始恢复,随后进入消息广播模式。
问题3 Observer角色是干啥的?
Observer角色其实是为了zk扩展用的,类似于上面提的Follower角色,但是Follower角色会参与Leader选举,而Observer角色是不参与的。
因为不参与选举,只需要接受同步数据即可,所以还可以实行跨域部署。
差不多写到这里,后面有时间再补充一些~