分布式协调器ZooKeeper3.4—程序员手册
【ZooKeeper是Apache Hadoop下的开源软件,是一个分布式的协调器,本文来自于Zookeeper的官方网站,地址为:http://zookeeper.apache.org/doc/r3.4.5/zookeeperProgrammers.html】Programmer'sGuide:Developing Distributed Applications that use Zo
【ZooKeeper是Apache Hadoop下的开源软件,是一个分布式的协调器,本文来自于Zookeeper的官方网站,地址为:http://zookeeper.apache.org/doc/r3.4.5/zookeeperProgrammers.html】
Programmer'sGuide:Developing Distributed Applications that use ZooKeeper
开发者指导:使用ZooKeeper开发分布式应用
Introduction
简介
This documentis a guide for developers wishing to create distributed applications that takeadvantage of ZooKeeper's coordination services. It contains conceptual andpractical information.
The first four sections of this guide present higher level discussions of variousZooKeeper concepts. These are necessary both for an understanding of howZooKeeper works as well how to work with it. It does not contain source code,but it does assume a familiarity with the problems associated with distributedcomputing. The sections in this first group are:
The next four sections provide practical programming information. These are:
- Building Blocks: A Guide to ZooKeeper Operations
- Bindings
- Program Structure, with Simple Example[tbd]
- Gotchas: Common Problems and Troubleshooting
The book concludes with an appendix containing links to other useful, ZooKeeper-related information.
Most of information in this document is written to be accessible as stand-alonereference material. However, before starting your first ZooKeeper application,you should probably at least read the chapters on theZooKeeper DataModel and ZooKeeper Basic Operations. Also, the Simple Programmming Example [tbd] is helpful for understanding the basic structure of a ZooKeeper client application.
本文适合于开发人员,他们希望利用ZooKeeper的协调服务来构建分布式系统,本文包含概念性内容和实际使用经验。
本手册的前4节是各种高层次的ZooKeeper概念,了解它们,对理解ZooKeeper如何工作和如何使用是必需的,这里不包含源代码,但假设读者熟悉分布式计算所面临的问题。这一部分包括:
- ZooKeeper数据模型
- 会话
- 监视器
- 一致性保证
下一部分的4节包含了实际的编程信息,即:
- 搭建积木: ZooKeeper操作指导
- 绑定
- 编程结构,简单例子【待定】
- 陷阱:常见问题及处理
本文包含一个附录,里面有与ZooKeeper相关的有用的信息的链接。
本文中的大部分信息可以单独拿来使用,但在开始你第一个ZooKeeper程序之前,你可能需要至少读一下ZooKeeper数据模型及ZooKeeper基本操作这两节,并且,简单编程实例【待完成】也有助于你理解ZooKeeperd客户端的基本结构。
TheZooKeeper Data Model
ZooKeeper数据模型
ZooKeeper has a hierarchal name space, much like a distributed file system. The only difference is that each node in the namespace can have data associated with itas well as children. It is like having a file system that allows a file to alsobe a directory. Paths to nodes are always expressed as canonical, absolute,slash-separated paths; there are no relative reference. Any unicode charactercan be used in a path subject to the following constraints:
- The null character (\u0000) cannot be part of a path name. (This causes problems with the C binding.)
- The following characters can't be used because they don't display well, or render in confusing ways: \u0001 - \u0019 and \u007F - \u009F.
- The following characters are not allowed: \ud800 -uF8FFF, \uFFF0-uFFFF, \uXFFFE - \uXFFFF (where X is a digit 1 - E), \uF0000 - \uFFFFF.
- The "." character can be used as part of another name, but "." and ".." cannot alone be used to indicate a node along a path, because ZooKeeper doesn't use relative paths. The following would be invalid: "/a/b/./c" or "/a/b/../c".
- The token "zookeeper" is reserved.
ZooKeeper有一个层次化的命名空间,特别像一个分布式文件系统,唯一不同的是这个命名空间中的每个节点既可以有子节点,也可以与之关联的数据,好像在一个文件系统中,一个文件也是一个目录。节点的路径总是被表达成规范的、绝对的、以/为分割符的形式,没有相对路径。任何符合下面规则的unicode字符都可以作为路径名称:
- Null字符 (\u0000) 不能作为路径的一部分。(用C绑定时会出问题。)
- 以下字符不能用,因为它们不能很好的显示或处理起来有歧义:\u0001 - \u0019 and \u007F - \u009F.
- 以下字母不允许: \ud800 -uF8FFF, \uFFF0-uFFFF, \uXFFFE - \uXFFFF (X 是 1 - E), \uF0000 - \uFFFFF.
- "." 可以作为名字的一部分,由于ZooKeeper没有相对路径,所以 "."和".." 不能单独用来表达一个节点或路径系,下面的表达是非法的: "/a/b/./c" 或 "/a/b/../c".
- 词 "zookeeper" 被保留使用。
ZNodes
Every node in a ZooKeeper tree is referred to as aznode. Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestampallow ZooKeeper to validate the cache and to coordinate updates. Each time aznode's data changes, the version number increases. For instance, whenever aclient retrieves data, it also receives the version of the data. And when a client performs an update or a delete, it must supply the version of the dataof the znode it is changing. If the version it supplies doesn't match theactual version of the data, the update will fail. (This behavior can beoverridden. For more information see... )[tbd...]
Znodes arethe main enitity that a programmer access. They have several characteristicsthat are worth mentioning here.
ZooKeeper树中每个节点被称作znode,Znode维护一个stat结构,其中包含了数据变化、acl变化的版本号,该结构也有一个时间戳。版本号加上时间戳,被ZooKeeper用来验证缓存的内容和协调更新。Znode内容更新一次,版本号增加一次。每次客户端读取数据,也会得到该数据的版本号。当客户端执行更新和删除操作时,它必须提供所操作数据的版本号。如果客户端提供的版本与数据的实际版本不匹配,更新操作会失败(这个操作不能是覆盖,详情请参阅…[待完成])。
Znode是开发人员主要访问的对象,它有几个值得关注的特性。
Watches
Clients canset watches on znodes. Changes to that znode trigger the watch and then clearthe watch. When a watch triggers, ZooKeeper sends the client a notification.More information about watches can be found in the sectionZooKeeperWatches.
监视器
客户端可以在znode上设置监视器,该znode的变化将触发并清除监视器。当监视器触发时,ZooKeeper通知客户端,关于监视器的详细情况,请参阅“ZooKeeperWatches”。
Data Access
The datastored at each znode in a namespace is read and written atomically. Reads getall the data bytes associated with a znode and a write replaces all the data.Each node has an Access Control List (ACL) that restricts who can do what.
ZooKeeper wasnot designed to be a general database or large object store. Instead, itmanages coordination data. This data can come in the form of configuration,status information, rendezvous, etc. A common property of the various forms ofcoordination data is that they are relatively small: measured in kilobytes. TheZooKeeper client and the server implementations have sanity checks to ensurethat znodes have less than 1M of data, but the data should be much less thanthat on average. Operating on relatively large data sizes will cause someoperations to take much more time than others and will affect the latencies ofsome operations because of the extra time needed to move more data over thenetwork and onto storage media. If large data storage is needed, the usuallypattern of dealing with such data is to store it on a bulk storage system, suchas NFS or HDFS, and store pointers to the storage locations in ZooKeeper.
数据访问
命名空间中每个znode的数据能被读和写,读是指得到该znode上关联的所有数据,写是指替换所有数据。每个节点上有一个ACL,控制谁可以做什么。
ZooKeeper并没用被设计用来做一个通用的数据库或大容量对象存储器,相反,它只管理有关协调所用的数据。这个数据可以是配置、状态信息、汇聚信息等,这些协调信息的一个共同特征是它们都较小:KB数量级。ZooKeeper服务器和客户端都应该检查znode数据小于1M,但真正存储的数据的平均大小应远小于它。因为需要更多的时间来通过网络传递数据或写入介质,较大的数据会使某些操作花费更多的时间,这会影响延迟。如果大容量数据是必需的,通常的处理方法是将它存储到大容量存储系统中,例如NFS或HDFS,而将指针保存到ZooKeeper中。
Ephemeral Nodes
ZooKeeperalso has the notion of ephemeral nodes. These znodes exists as long as thesession that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children.
暂态节点
ZeeKeeper有暂态节点的概念,这些节点仅在创建它们的会话存在而存在,当会话结束后,节点就被删除了,由于暂态节点的这种特性,它不允许有子节点。
SequenceNodes -- Unique Naming
When creating a znode you can also request that ZooKeeper append a monotonically increasing counter to the end of path. This counter is unique to the parent znode. The counter has a format of %010d -- that is 10 digits with 0 (zero) padding (the counter is formatted in this way to simplify sorting), i.e."<path>0000000001". SeeQueue Recipe for an example use of this feature. Note: the counter used to store the next sequence number is a signed int (4bytes) maintained by the parent node, the counter will overflow when incremented beyond 2147483647 (resulting in a name"<path>-2147483647").
序列化节点-唯一命名
当创建一个znode时,你可以要求ZooKeeper添加一个单调增的数字在路径的后面,这个数字对父节点是唯一的,采用%010d这种格式,即补零方式的10位数字(这样做是为了简化排序),例如"<path>0000000001",请参阅“Queue Recipe”中使用这种特性的例子。注:这个用于生成下一个节点的数字是由父节点维护的一个整数(4字节),如果该数字超过2147483647,将溢出(结果导致产生一个名字"<path>-2147483647")。
Time in ZooKeeper
ZooKeeper tracks time multiple ways:
- Zxid
Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2. - Version numbers
Every change to a a node will cause an increase to one of the version numbers of that node. The three version numbers are version (number of changes to the data of a znode), cversion (number of changes to the children of a znode), and aversion (number of changes to the ACL of a znode). - Ticks
When using multi-server ZooKeeper, servers use ticks to define timing of events such as status uploads, session timeouts, connection timeouts between peers, etc. The tick time is only indirectly exposed through the minimum session timeout (2 times the tick time); if a client requests a session timeout less than the minimum session timeout, the server will tell the client that the session timeout is actually the minimum session timeout. - Real time
ZooKeeper doesn't use real time, or clock time, at all except to put timestamps into the stat structure on znode creation and znode modification.
ZooKeeper中的时间
ZooKeeper中有关时间的使用有几方面:
- Zxid
ZooKeeper状态的每次变化都会有一个标签,zxid(ZooKeeper事务Id),它是一种对ZooKeeper所有修改的排序,每一个修改都有一个唯一的zxid,如果zxid1小于zxid2,则zxid1比zxid2早发生。 - Version numbers
Znode每次变化都会增加一个版本号,有3个版本号:version(znode数据变化的版本号),cversion(znode子节点变化的版本号),aversion(znode ACL变化版本号)。Ticks
当使用多个服务器搭建ZooKeeper时,服务器使用tick来定义事件的定时,例如,状态上载、会话超时、各服务器成员间连接超时等。Tick时间仅通过最小会话超时(2倍的tick时间)非直接地暴露出来,如果一个客户端要求会话超时时间小于最小会话超时时间,则服务器会告诉客户端会话超时时间实际上是最小会话超时时间。 - Real time
ZooKeeper根本不使用真实时间或时钟时间,例外情况是znode创建和修改时放入stat结构中的时间戳。
ZooKeeper Stat Structure
The Stat structure for each znode in ZooKeeper is made up of the following fields:
- czxid
The zxid of the change that caused this znode to be created. - mzxid
The zxid of the change that last modified this znode. - ctime
The time in milliseconds from epoch when this znode was created. - mtime
The time in milliseconds from epoch when this znode was last modified. - version
The number of changes to the data of this znode. - cversion
The number of changes to the children of this znode. - aversion
The number of changes to the ACL of this znode. - ephemeralOwner
The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero. - dataLength
The length of the data field of this znode. - numChildren
The number of children of this znode.
ZooKeeper Stat
结构每个znode节点上的Stat结构由以下域构成:
- czxid
创建这个znode的zxid。 - mzxid
最近一次修改这个znode的zxid。 - ctime
这个znode的创建时间(从epoch起的毫秒数)。 - mtime
最近一次修改这个znode的时间(从epoch起的毫秒数)。 - version
这个znode数据的修改次数(版本)。 - cversion
这个znode的子节点的修改次数(版本)。 - aversion
这个znode的ACL的修改次数(版本)。 - ephemeralOwner
如果这个节点是暂态节点,这个值就是拥有者的会话id,如果不是暂态节点就是0。 - dataLength
这个znode的数据长度。 - numChildren
这个znode的子节点个数。
ZooKeeperSessions
ZooKeeper会话
A ZooKeeper client establishes a session with the ZooKeeper service by creating a handle tothe service using a language binding. Once created, the handle starts of in the CONNECTING state and the client library tries to connect to one of the serversthat make up the ZooKeeper service at which point it switches to the CONNECTEDstate. During normal operation will be in one of these two states. If an unrecoverable error occurs, such as session expiration or authentication failure, or if the application explicitly closes the handle, the handle will move to the CLOSED state. The following figure shows the possible state transitions of a ZooKeeper client:
与一种语言绑定,ZooKeeper客户端可以通过创建一个句柄,建立一个与ZooKeeper服务的会话。一旦创建句柄,这个句柄被标志为CONNECTING状态,客户端库就会连接构成ZooKeeper服务的其中一台服务器,直到句柄变为CONNECTED状态。正常的操作中,就是这两种状态之一。如果发生了不能恢复的错误,例如,会话超期或安全认证失败,或者如果应用显式地关闭了这个句柄,这个句柄就变成CLOSED状态。下图给出了ZooKeerp客户端几种可能状态的转变。
To create aclient session the application code must provide a connection string containing a comma separated list of host:port pairs, each corresponding to a ZooKeeper server (e.g. "127.0.0.1:4545" or "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002"). The ZooKeeper client library will pick an arbitrary server and try to connect to it. If this connection fails, or if the client becomes disconnected from the server for anyreason, the client will automatically try the next server in the list, until aconnection is (re-)established.
为创建一个客户端会话,应用程序必须提供一个以逗号分隔的host:port形式的ZooKeeper服务器列表作为连接字符串(如,"127.0.0.1:4545"或"127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002")。ZooKeeper客户端库会任意选中一台服务器并尝试连接它。如果连接失败或客户端由于某种原因与服务器断开了,客户端会自动尝试这个列表的下一台服务器,直到连接(重新)建立。
Added in3.2.0: An optional "chroot" suffix may also beappended to the connection string. This will run the client commands whileinterpreting all paths relative to this root (similar to the unix chrootcommand). If used the example would look like: "127.0.0.1:4545/app/a"or "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002/app/a" where theclient would be rooted at "/app/a" and all paths would be relative tothis root - ie getting/setting/etc... "/foo/bar" would result in operations being run on "/app/a/foo/bar" (from the serverperspective). This feature is particularly useful in multi-tenant environments where each user of a particular ZooKeeper service could be rooted differently.This makes re-use much simpler as each user can code his/her application as ifit were rooted at "/", while actual location (say /app/a) could bedetermined at deployment time.
3.2.0版新添加的:一个可选的后缀”chroot”可以被添加到连接字符串,这将使客户端的命令解析都从这个根开始(类似于unix的chroot命令)。如果使用如下的例子:"127.0.0.1:4545/app/a"或"127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002/app/a",客户端将以"/app/a"作为根,所有的路径将相对于这个根,也就是说,从"/foo/bar"取数据将导致在"/app/a/foo/bar"(从服务器的角度看)处操作。这个特性特别对多租户场合有用,这种情况下,一个特定ZooKeeper服务的每个用户都有不同的根,因为每个用户编码时,都可以从根”/”开始考虑,这样重用就变得很简单,当然,实际的位置(比如/app/a)是在部署时确定的。
When a clientgets a handle to the ZooKeeper service, ZooKeeper creates a ZooKeeper session,represented as a 64-bit number, that it assigns to the client. If the client connects to a different ZooKeeper server, it will send the session id as a part of the connection handshake. As a security measure, the server creates apassword for the session id that any ZooKeeper server can validate.The passwordis sent to the client with the session id when the client establishes thesession. The client sends this password with the session id whenever it reestablishes the session with a new server.
当客户端从ZooKeeper服务得到一个句柄,ZooKeeper就创建了一个会话,以一个64位整数表示,并将其赋给这个客户。如果这个客户端连接一个不同的服务器,它会将这个会话id作为连接握手的一部分。作为一种安全策略,服务器可以为这个会话id创建一个密码,任何一台ZooKeeper服务器都可以验证。在建立会话时,这个密码随会话id发送给客户端,当这个客户端想与一台新服务器建立连接时,就会发送会话id及这个密码。
One of the parameters to the ZooKeeper client library call to create a ZooKeeper sessionis the session timeout in milliseconds. The client sends a requested timeout,the server responds with the timeout that it can give the client. The current implementation requires that the timeout be a minimum of 2 times the tickTime(as set in the server configuration) and a maximum of 20 times the tickTime.The ZooKeeper client API allows access to the negotiated timeout.
ZooKeeper客户端库用来创建会话的一个参数是会话超时时间(毫秒)。客户端发送一个要求的超时时间,服务器响应这个时间并发送给客户端。目前的实现是这个超时时间最小为2倍的tickTime(在服务器配置中),最大为20倍tickTime。ZooKeeper客户端API允许协商这个超时时间。
When a client(session) becomes partitioned from the ZK serving cluster it will begin searching the list of servers that were specified during session creation.Eventually, when connectivity between the client and at least one of theservers is re-established, the session will either again transition to the"connected" state (if reconnected within the session timeout value) orit will transition to the "expired" state (if reconnected after thesession timeout). It is not advisable to create a new session object (a newZooKeeper.class or zookeeper handle in the c binding) for disconnection. The ZKclient library will handle reconnect for you. In particular we have heuristics built into the client library to handle things like "herd effect",etc... Only create a new session when you are notified of session expiration(mandatory).
当一个客户(会话)从ZooKeeper集群断开时,它将开始搜索指定的服务器列表,最终,当客户端与至少某台服务器再次建立连接后,这个会话将变成“连接“状态(如果再连接的时间小于超时时间)或者变成”过期“状态(再连接时间在超时以后)。断开之后不建议再创建一个新会话对象(即一个新的ZooKeeper.class或在C语言绑定时的handle),ZooKeeper客户端库会为你处理再连接。特别的,我们在客户端库中有探索式尝试等处理这类事情。仅在你被通知会话过期时才创建新会话对象(强制性的)。
Session expiration is managed by the ZooKeeper cluster itself, not by the client. When the ZK client establishes a session with the cluster it provides a"timeout" value detailed above. This value is used by the cluster to determine when the client's session expires. Expirations happens when thecluster does not hear from the client within the specified session timeout period (i.e. no heartbeat). At session expiration the cluster will delete any/all ephemeral nodes owned by that session and immediately notify any/allconnected clients of the change (anyone watching those znodes). At this point the client of the expired session is still disconnected from the cluster, itwill not be notified of the session expiration until/unless it is able tore-establish a connection to the cluster. The client will stay in disconnected state until the TCP connection is re-established with the cluster, at which point the watcher of the expired session will receive the "sessionexpired" notification.
会话过期由ZooKeeper集群自己管理,而不是客户端。当一个客户端与集群连接时,如上面表述的,它会提供一个“超时“值。这个值被集群用来判断客户端会话是否超时。超时发生在集群在超时时间内没有收到来自客户端的消息(例如,心跳)。发生超时时,集群删除这个会话所拥有的暂态节点,并立即通知有关连接的客户端(那些对这些节点添加了监视器的客户)。这时,如果过期的会话仍没有连接到集群,它不会被通知到已过期,直到它再次连接,才通知它会话过期了。客户端保持断开状态,直到与集群的TCP连接再次建立,这时,过期的会话的监视器将接收到”会话过期“的通知。
Example state transitions for an expired session as seen by the expired session's watcher:- 'connected' : session is established and client is communicating with cluster (client/server communication is operating properly)
- .... client is partitioned from the cluster
- 'disconnected' : client has lost connectivity with the cluster
- .... time elapses, after 'timeout' period the cluster expires the session, nothing is seen by client as it is disconnected from cluster
- .... time elapses, the client regains network level connectivity with the cluster
- 'expired' : eventually the client reconnects to the cluster, it is then notified of the expiration
- 'connected' : 会话建立,客户端可以与集群通讯 (客户端/服务器通讯正常)
- .... 客户端与集群断开
- 'disconnected' : 客户端已经与集群失去连接
- .... 一段时间,已超过timeout时间,集群将会话过期,客户端不会知道,应为它已与集群断开
- .... 又一段时间,客户端在网络层与集群再次连接
- 'expired' : 最终客户端连接上集群,它被通知已过期
Another parameter to the ZooKeeper session establishment call is the default watcher. Watchers are notified when any state change occurs in the client. For exampleif the client loses connectivity to the server the client will be notified, orif the client's session expires, etc... This watcher should consider theinitial state to be disconnected (i.e. before any state changes events are sentto the watcher by the client lib). In the case of a new connection, the firstevent sent to the watcher is typically the session connection event.
ZooKeeper建立会话的另一个参数是缺省的监视器。在客户端的任何状态的变化就会通知监视器。例如如果客户端与服务器失去连接,客户端就会被通知,或者,如果客户端会话过期…,等,监视器认为开始处于断开状态(即,任何事件发生前,客户端库都将事件通知监视器)。当建立新连接时,第一个给监视器的时间通常是会话连接事件。
The sessionis kept alive by requests sent by the client. If the session is idle for aperiod of time that would timeout the session, the client will send a PING request to keep the session alive. This PING request not only allows the ZooKeeperserver to know that the client is still active, but it also allows the clientto verify that its connection to the ZooKeeper server is still active. The timing of the PING is conservative enough to ensure reasonable time to detect adead connection and reconnect to a new server.
会话由客户端不断发送的请求保持住。如果会话在一段时间内无事可做,有可能引起超时,则会话应发送PING请求来保持住这个会话。这个PING请求不但让ZooKeeper服务器知道客户端还存在,也让客户端知道服务器也存在。PING的时间间隔应足够保守,保证有充分的时间来检测一个死连接和重建一个到新服务器的新连接。
Once aconnection to the server is successfully established (connected) there arebasically two cases where the client lib generates connection loss (the resultcode in c binding, exception in Java -- see the API documentation for binding specific details) when either a synchronous or asynchronous operation is performed and one of the following holds:
- The application calls an operation on a session that is no longer alive/valid
- The ZooKeeper client disconnects from a server when there are pending operations to that server, i.e., there is a pending asynchronous call.
一旦建立到服务器的连接,当进行一个同步或异步操作时,基本上,有两种情况可以让客户端库产生“失去连接“(C绑定时的返回值,Java绑定时的异常—详情请参阅特定绑定的API文档)
- 在一个失效的或断开的会话上操作。
- ZooKeeper客户端与服务器断开是因为有未完成的操作,即有未完成的异步调用。
Added in3.2.0 -- SessionMovedException. There is an internal exception that is generally not seen by clients called the SessionMovedException. This exception occurs because a request was received on a connection for a session which has be reestablished on a different server. The normal cause of this error is a client that sends a request to a server, but the network packet gets delayed, so the client times out and connects to a new server. When the delayed packet arrives at the first server, the old server detects that the session has moved, and closes the client connection. Clients normally do not see this error since they do not read from those old connections. (Old connections are usually closed.) One situation in which this condition can be seen is when two clientstry to reestablish the same connection using a saved session id and password.One of the clients will reestablish the connection and the second client willbe disconnected (causing the pair to attempt to re-establish it'sconnection/session indefinitely).
3.2.0版新增的—SessionMovedException。有一个通常客户端看不到的内部异常,SessionMovedException,这个异常发生在客户端请求被接收,会话在另一台服务器被重新建立的时候。通常,这个错误是客户端发送一个请求给服务器,但网络数据包被延迟了,客户端超时时间到并连接到一个新服务器。当延迟的数据包到达第一个服务器,这个服务器检测到会话已转移了,就关闭连接。客户端通常不会看到这个错误因为它不会从那些旧的连接中读取数据(这些旧连接通常已关闭)。一种能看到这个错误的场合是:两个客户端使用保存的会话id和密码尝试再次建立这个会话,其中一个客户端能再次建立这个连接而另一个将断开(这一对尝试再连接的客户端中哪个连接上不明确)。
ZooKeeperWatches
ZooKeeper监视器
All of the read operations in ZooKeeper - getData(), getChildren(), and exists()- have the option of setting a watch as a side effect. Here is ZooKeeper'sdefinition of a watch: a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was setchanges. There are three key points to consider in this definition of a watch:
- One-time trigger
One watch event will be sent to the client when the data has changed. For example, if a client does a getData("/znode1", true) and later the data for /znode1 is changed or deleted, the client will get a watch event for /znode1. If /znode1 changes again, no watch event will be sent unless the client has done another read that sets a new watch. - Sent to the client
This implies that an event is on the way to the client, but may not reach the client before the successful return code to the change operation reaches the client that initiated the change. Watches are sent asynchronously to watchers. ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event. Network delays or other factors may cause different clients to see watches and return codes from updates at different times. The key point is that everything seen by the different clients will have a consistent order. - The data for which the watch was set
This refers to the different ways a node can change. It helps to think of ZooKeeper as maintaining two lists of watches: data watches and child watches. getData() and exists() set data watches. getChildren() sets child watches. Alternatively, it may help to think of watches being set according to the kind of data returned. getData() and exists() return information about the data of the node, whereas getChildren() returns a list of children. Thus, setData() will trigger data watches for the znode being set (assuming the set is successful). A successful create() will trigger a data watch for the znode being created and a child watch for the parent znode. A successful delete() will trigger both a data watch and a child watch (since there can be no more children) for a znode being deleted as well as a child watch for the parent znode.
ZooKeeper中所有的读操作—getData(), getChildren(), exists()—都有一个选项:设置一个监视器,作为附带的功能。ZooKeeper监视器的定义如下:一个监视器事件是一个一次性触发事件,它被发送到设置它的客户端,它发生的条件是它监视的数据发生变化了。关于监视器的定义,这里有3个关键点需要考虑:
- 一次性触发
当数据发生变化时,监视器事件被发送到客户端。例如,如果客户端执行getData("/znode1", true),而后来/znode1的数据变化了或删除了,客户端就会得到一个/znode1变化的监视器事件,如果/znode1又发生了变化,不会发送监视器事件,除非该客户端再次执行读操作而设置了一个新的监视器。 - 发送给客户端
这意味着事件可能在发送给客户端的路上,在触发变化的客户端没有成功返回之前,事件不会到达客户端。监视器事件是异步发送给监视器的。ZooKeeper提供顺序操作保证:直到它首先看到监视器事件为止,客户端不会看到变化。网络延迟或其它因素可能会使不同的客户端在不同的时间看到监视器事件及更新的返回值。关键点是不同客户端看到的所有事情都有一致的顺序。 - 被监视器监视的数据
这是指一个节点能变化的几个方式,它有助于思考ZooKeeper所维护的两类监视器列表:数据监视器和子节点监视器。getData()和exists()设置数据监视器,getChildren()设置子节点监视器。这也可以从被设置的监视器返回数据的类型来考虑监视器的分类。getData() 和 exists()返回有关节点数据的信息,而getChildren()返回一系列子节点,因此,setData()将触发znode上数据的监视器(假设成功设置)。一次成功的create()调用触发创建节点的数据监视器和父节点的子节点监控器。一次成功的delete()调用将触发被删除节点的数据监视器和子节点监视器(因为可能再没有子节点了),并且触发其父节点的子节点监视器。
Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be light weight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered andtriggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existance of a znode not yetcreated will be missed if the znode is created and deleted while disconnected.
监视器在客户端所连接的ZooKeeper服务器上维护,这样使监视器可以被轻量级地设置、维护和分发。当一个客户端连接到新服务器,对任何会话事件的监视器被触。如何客户端不能连接到服务器,则不能接收到监视器。当客户端再次连接上,以前注册的监视器被再次注册和触发(如果需要)。通常,这些是透明发生的,有一种情况监视器可能被遗漏:对还未被创建的zonde设置的存在监视器,在断开时,被创建和删除。
What ZooKeeper Guarantees about Watches
With regard to watches, ZooKeeper maintains these guarantees:
- Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.
- A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.
- The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.
Things to Remember about Watches
- Watches are one time triggers; if you get a watch event and you want to get notified of future changes, you must set another watch.
- Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.)
- A watch object, or function/context pair, will only be triggered once for a given notification. For example, if the same watch object is registered for an exists and a getData call for the same file and that file is then deleted, the watch object would only be invoked once with the deletion notification for the file.
- When you disconnect from a server (for example, when the server fails), you will not get any watches until the connection is reestablished. For this reason session events are sent to all outstanding watch handlers. Use session events to go into a safe mode: you will not be receiving events while disconnected, so your process should act conservatively in that mode.
对于监视器,ZooKeeper保证了什么
对于监视器,ZooKeeper能做得以下保证:
- 与其他事件、其他监视器和异步应答一起,一种监视器是顺序的,ZooKeeper客户端保证所有的东西都被排序,依次分发。
- 客户端将先看到它所监视的znode的事件,然后才是这个znode的新数据。
- ZooKeeper中监视器事件的顺序,就是ZooKeeper服务所看到的更新顺序。
关于监视器,应该注意什么
- 监视器是一次性触发器,你得到一个监视器事件后,如果你还想得到以后数据变化的通知事件,你必须再设置一个监视器。
- 由于监视器是一个一次性触发器,在得到一个事件和发送一个新监视器之间有时延,这期间你不能可靠保证得到该节点的变化。要考虑这样的情况:在得到事件与再次设置监视器之间,znode已变化了多次(你可能不会在意,但需要意识到这种情况的存在)。
- 一个监视器对象(或者说是方法/上下文对),对特定的通知只触发一次,例如,如果一个同样的监视器对象被注册成对同一个文件进行“exist”和”getData”监视,当这个文件被删除时,这个监视器对象只触发一次,即删除事件的通知。
- 当你与服务器断开(例如,服务器宕机了),在重新建立连接之前,你不会得到任何监视器消息,出于这种原因,会话事件被发送给所有未完成的监视器。使用会话事件进入安全模式:在断开期间你不会收到事件,所以你的程序此时应保守一些。
ZooKeeper access control using ACLs
ZooKeeper使用ACL控制访问
ZooKeeper uses ACLs to control access to its znodes (the data nodes of a ZooKeeper datatree). The ACL implementation is quite similar to UNIX file access permissions:it employs permission bits to allow/disallow various operations against a nodeand the scope to which the bits apply. Unlike standard UNIX permissions, a ZooKeeper node is not limited by the three standard scopes for user (owner ofthe file), group, and world (other). ZooKeeper does not have a notion of anowner of a znode. Instead, an ACL specifies sets of ids and permissions thatare associated with those ids.
Note also that an ACL pertains only to a specific znode. In particular it does not applyto children. For example, if/app is only readable by ip:172.16.16.1 and/app/status is world readable, anyone will be able to read/app/status;ACLs are not recursive.
ZooKeeper supports pluggable authentication schemes. Ids are specified using the formscheme:id,where scheme is a the authentication scheme that the id corresponds to.For example,ip:172.16.16.1 is an id for a host with the address 172.16.16.1.
When a client connects to ZooKeeper and authenticates itself, ZooKeeper associates all the ids that correspond to a client with the clients connection. These ids arechecked against the ACLs of znodes when a clients tries to access a node. ACLsare made up of pairs of (scheme:expression, perms). The format of theexpressionis specific to the scheme. For example, the pair (ip:19.22.0.0/16, READ)gives theREAD permission to any clients with an IP address that startswith 19.22.
ZooKeeper采用ACL来控制znode的访问,ACL的实现方式与UNIX中文件的访问控制很相似:它采用权限位来允许/拒绝对节点的各种操作以及能进行操作的范围,与UNIX权限不同的是,ZooKeeper节点并不局限于标准的三类范围:文件的拥有者、组和其他人。ZooKeeper并没有znode拥有者的概念,相反,一条ACL指定id集以及与之对应的权限。
还要注意的是一条ACL只针对一个特点的znode,即,它不适用于子节点。例如,如果/app只对ip:172.16.16.1可读,而/app/status对任何人可读,ACL不是递归的。
ZooKeeper支持插入的验证方案。Id采用如下的形式:scheme:id,其中scheme是id所对应的认证方案。例如,对ip:172.16.16.1,id是主机的地址172.16.16.1。
当客户端连接到ZooKeeper验证自己时,ZooKeeper将有关该客户端的所有Id与客户连接关联,当客户端想访问节点时,这些Id与该节点的ACL进行验证,而ACL由(scheme:expression, perms)对构成,其中expression的格式指定为scheme。例如,(ip:19.22.0.0/16, READ)值对表示对所有起始IP为19.22的客户端具有读权限。
ACL Permissions
ZooKeeper supports the following permissions:
- CREATE: you can create achild node
- READ: you can get datafrom a node and list its children.
- WRITE: you can set datafor a node
- DELETE: you can delete achild node
- ADMIN: you can setpermissions
The CREATE and DELETE permissions have been broken out of theWRITE permission for finer grained access controls. The cases for CREATE andDELETE are the following:
You want A to be able to do a set on a ZooKeeper node, but not be able toCREATE or DELETE children.
CREATE without DELETE: clients create requests by creating ZooKeeper nodes in aparent directory. You want all clients to be able to add, but only request processor can delete. (This is kind of like the APPEND permission for files.)
Also, the ADMIN permission is there since ZooKeeper doesn’t have a notion of file owner. Insome sense theADMIN permission designates the entity as the owner. ZooKeeper doesn’t support the LOOKUP permission (execute permission bit ondirectories to allow you to LOOKUP even though you can't list the directory).Everyone implicitly has LOOKUP permission. This allows you to stat a node, but nothing more. (The problem is, if you want to call zoo_exists() on a node that doesn't exist, there is no permission to check.)
ACL权限
ZooKeeper支持以下权限:
- CREATE: 能创建子节点
- READ: 能获取节点数据及列出它的子节点
- WRITE: 能设置节点数据
- DELETE: 能删除子节点
- ADMIN: 能设置权限
CREATE和DELETE权限从写权限中分离出来,为的是获得更好的访问控制。运用CREATE和DELETE的场合如下:
你想让A用户能够设置节点数据,但不允许创建或删除子节点。
具有CREATE但无DELETE权限:客户端发出创建请求,是在父目录下创建创建节点,你想让所有的客户能添加节点,但只有创建的申请者能删除(这类似于文件的APPEND权限)。
另外,具有ADMIN权限是因为ZooKeeper没有文件拥有者这个概念。从某些意义上,具有ADMIN权限就意味着节点的拥有者。ZooKeeper不支持LOOKUP权限(目录上的执行权限位允许你查看,即使你不能列出目录)。所有人都隐含具有LOOKUP权限。这允许你查看一个节点的状态,但不能做其他事情(问题是,如果你对一个不存在的节点调用zoo_exists(),不会进行安全检查)。
Builtin ACL SchemesZooKeeeper has the following built in schemes:
- world has a single id, anyone, that represents anyone.
- auth doesn't use any id, represents any authenticated user.
- digest uses a username:password string to generate MD5 hash which is then used as an ACL ID identity. Authentication is done by sending theusername:password in clear text. When used in the ACL the expression will be theusername:base64 encoded SHA1 password digest.
- ip uses the client host IP as an ACL ID identity. The ACL expression is of the formaddr/bits where the most significant bits of addr are matched against the most significantbits of the client host IP.
内置的ACL方案
ZooKeeper有如下内置的方案
- world 有一个唯一的id, anyone,代表所有人。
- auth 不使用任何id,代表任何已认证的用户。
- digest 用username:password 字符串来产生一个MD5串,然后该串被用来作为ACL ID。认证是通过明文发送username:password 来进行的,当用在ACL时,表达式为username:base64,base64是password的SHA1摘要的编码。
- ip 使用客户端的主机IP作为ACL ID 。这个ACL表达式的格式为addr/bits ,此时addr中的有效位与客户端addr中的有效位进行比对。
ZooKeeper C client API
The following constants are provided by the ZooKeeper C library:
- const int ZOO_PERM_READ; //can read node’s value and list its children
- const int ZOO_PERM_WRITE;// can set the node’s value
- const int ZOO_PERM_CREATE; //can create children
- const int ZOO_PERM_DELETE;// can delete children
- const int ZOO_PERM_ADMIN; //can execute set_acl()
- const int ZOO_PERM_ALL;// all of the above flags OR’d together
The followingare the standard ACL IDs:
- struct Id ZOO_ANYONE_ID_UNSAFE; //(‘world’,’anyone’)
- struct Id ZOO_AUTH_IDS;// (‘auth’,’’)
ZOO_AUTH_IDS empty identity string should be interpreted as “the identity of the creator”.
ZooKeeper client comes with three standard ACLs:
- struct ACL_vector ZOO_OPEN_ACL_UNSAFE; //(ZOO_PERM_ALL,ZOO_ANYONE_ID_UNSAFE)
- struct ACL_vector ZOO_READ_ACL_UNSAFE;// (ZOO_PERM_READ, ZOO_ANYONE_ID_UNSAFE)
- struct ACL_vector ZOO_CREATOR_ALL_ACL; //(ZOO_PERM_ALL,ZOO_AUTH_IDS)
TheZOO_OPEN_ACL_UNSAFE is completely open free for all ACL: any application can execute any operation on the node and can create, list and delete its children.The ZOO_READ_ACL_UNSAFE is read-only access for any application. CREATE_ALL_ACLgrants all permissions to the creator of the node. The creator must have been authenticated by the server (for example, using “digest” scheme) beforeit can create nodes with this ACL.
ZooKeeper C客户端API
以下的常量是ZooKeeperC语言库中提供的:
- const int ZOO_PERM_READ; //能读节点的值及列出其子节点
- const int ZOO_PERM_WRITE; //能设置节点的值
- const int ZOO_PERM_CREATE; //能创建子节点
- const int ZOO_PERM_DELETE; // 能删除子节点
- const int ZOO_PERM_ADMIN; //能执行set_acl()
- const int ZOO_PERM_ALL; // 上面所有值的OR
以下是标准的ACLID:
- struct Id ZOO_ANYONE_ID_UNSAFE; //(‘world’,’anyone’)
- struct Id ZOO_AUTH_IDS;// (‘auth’,’’)
ZOO_AUTH_IDS 为空时,应被解释成“创建者的Id”
ZooKeeper 客户端有3种标准的ACL:
- struct ACL_vector ZOO_OPEN_ACL_UNSAFE; //(ZOO_PERM_ALL,ZOO_ANYONE_ID_UNSAFE)
- struct ACL_vector ZOO_READ_ACL_UNSAFE;// (ZOO_PERM_READ, ZOO_ANYONE_ID_UNSAFE)
- struct ACL_vector ZOO_CREATOR_ALL_ACL; //(ZOO_PERM_ALL,ZOO_AUTH_IDS)
ZOO_OPEN_ACL_UNSAFE使所有ACL都“开放”了:任何应用程序在节点上可进行任何操作,能创建、列出和删除它的子节点。对任何应用程序,ZOO_READ_ACL_UNSAFE是只读的。CREATE_ALL_ACL赋予了节点的创建者所有的权限,在创建者采用此ACL创建节点之前,已经被服务器所认证(例如,采用 “digest”方案)。
The following ZooKeeper operations deal with ACLs:
- int zoo_add_auth (zhandle_t *zh,const char* scheme,const char* cert,int certLen, void_completion_t completion, const void *data);
The application uses the zoo_add_auth function to authenticate itself to theserver. The function can be called multiple times if the application wants to authenticate using different schemes and/or identities.
- int zoo_create (zhandle_t *zh, const char *path,const char *value,int valuelen, const struct ACL_vector *acl,int flags,char *realpath, int max_realpath_len);
zoo_create(...) operation creates a new node. The acl parameter is a list of ACLs associated with the node. The parent node must have the CREATE permission bit set.
- int zoo_get_acl (zhandle_t *zh, const char *path,struct ACL_vector *acl,struct Stat *stat);
This operation returns a node’s ACL info.
- int zoo_set_acl (zhandle_t *zh, const char *path,int version,const struct ACL_vector *acl);
This function replaces node’s ACL list with a new one. The node must have the ADMINpermission set.
以下ZooKeeper方法处理ACL:
- int zoo_add_auth (zhandle_t *zh,const char* scheme,const char* cert,int certLen, void_completion_t completion, const void *data);
应用程序使用zoo_add_auth方法来向服务器认证自,如果想用不同的方案来认证,这个方法可以被调用多次。
- int zoo_create (zhandle_t *zh, const char *path, const char *value,int valuelen,const struct ACL_vector *acl, int flags,char *realpath,int max_realpath_len);
zoo_create(...)方法创建一个新节点。acl 参数是一个与这个节点关联的ACL列表,父节点权限项的CREATE位已被设(set,即由权限)。
- int zoo_get_acl (zhandle_t *zh, const char *path,struct ACL_vector *acl,struct Stat *stat);
这个方法返回这个节点的ACL信息。
- int zoo_set_acl (zhandle_t *zh, const char *path, int version,const struct ACL_vector *acl);
这个方法用新的ACL列表替换老的,这个节点的ADMIN位必须被设置(set,即具有ADMIN权限)。
Here is as ample code that makes use of the above APIs to authenticate itself using the “foo”scheme and create an ephemeral node “/xyz” with create-only permissions.
这有一个使用以上API的例子,采用”foo”方案认证,创建一个“/xyz”的暂态节点,设置其为”只创建“权限。
#include <string.h>
#include <errno.h>
#include "zookeeper.h"
static zhandle_t *zh;
/**
* In this example this method gets the cert for your
* environment -- you must provide
*/
char *foo_get_cert_once(char* id) { return 0; }
/** Watcher function -- empty for this example, not something you should
* do in real code */
void watcher(zhandle_t *zzh, int type, int state, const char *path,
void *watcherCtx) {}
int main(int argc, char argv) {
char buffer[512];
char p[2048];
char *cert=0;
char appId[64];
strcpy(appId, "example.foo_test");
cert = foo_get_cert_once(appId);
if(cert!=0) {
fprintf(stderr,
"Certificate for appid [%s] is [%s]\n",appId,cert);
strncpy(p,cert, sizeof(p)-1);
free(cert);
} else {
fprintf(stderr, "Certificate for appid [%s] not found\n",appId);
strcpy(p, "dummy");
}
zoo_set_debug_level(ZOO_LOG_LEVEL_DEBUG);
zh = zookeeper_init("localhost:3181", watcher, 10000, 0, 0, 0);
if (!zh) {
return errno;
}
if(zoo_add_auth(zh,"foo",p,strlen(p),0,0)!=ZOK)
return 2;
struct ACL CREATE_ONLY_ACL[] = {{ZOO_PERM_CREATE, ZOO_AUTH_IDS}};
struct ACL_vector CREATE_ONLY = {1, CREATE_ONLY_ACL};
int rc = zoo_create(zh,"/xyz","value", 5, &CREATE_ONLY, ZOO_EPHEMERAL,
buffer, sizeof(buffer)-1);
/** this operation will fail with a ZNOAUTH error */
int buflen= sizeof(buffer);
struct Stat stat;
rc = zoo_get(zh, "/xyz", 0, buffer, &buflen, &stat);
if (rc) {
fprintf(stderr, "Error %d for %s\n", rc, __LINE__);
}
zookeeper_close(zh);
return 0;
}
Pluggable ZooKeeper authentication
ZooKeeper runs in a variety of different environments with various different authentication schemes, so it has a completely pluggable authentication framework. Even the builtin authentication schemes use the pluggable authentication framework.
To understand how the authentication framework works, first you must understand the two main authentication operations. The framework first must authenticate the client.This is usually done as soon as the client connects to a server and consists of validating information sent from or gathered about a client and associating itwith the connection. The second operation handled by the framework is findingthe entries in an ACL that correspond to client. ACL entries are <idspec,permissions> pairs. The idspec may be a simple string match against the authentication information associated with the connection or it maybe a expression that is evaluated against that information. It is up to the implementation of the authentication plugin to do the match. Here is the interface that an authentication plugin must implement:
public interface AuthenticationProvider {
String getScheme();
KeeperException.Code handleAuthentication(ServerCnxn cnxn, byte authData[]);
boolean isValid(String id);
boolean matches(String id, String aclExpr);
boolean isAuthenticated();
}
The first method getScheme returns the string that identifies the plugin. Because we support multiple methods of authentication, an authentication credential oranidspec will always be prefixed with scheme:. The ZooKeeper server uses the scheme returned by the authentication plugin to determine which ids the scheme applies to.
handleAuthentication iscalled when a client sends authentication information to be associated with aconnection. The client specifies the scheme to which the information corresponds. The ZooKeeper server passes the information to the authenticationplugin whose getScheme matches the scheme passed by the client. Theimplementor ofhandleAuthentication will usually return an error if it determines that the information is bad, or it will associate information with the connection usingcnxn.getAuthInfo().add(new Id(getScheme(), data)).
The authentication plugin is involved in both setting and using ACLs. When an ACLis set for a znode, the ZooKeeper server will pass the id part of the entry totheisValid(String id) method. It is up to the plugin to verify that theid has a correct form. For example,ip:172.16.0.0/16 is a valid id, but ip:host.comis not. If the new ACL includes an "auth" entry,isAuthenticatedis used to see if the authentication information for this scheme that isassocatied with the connection should be added to the ACL. Some schemes shouldnot be included in auth. For example, the IP address of the client is notconsidered as an id that should be added to the ACL if auth is specified.
ZooKeeper invokes matches(String id, String aclExpr) when checking an ACL. It needs to match authentication information of the client against the relevant ACL entries. To find the entries which apply to the client, the ZooKeeper server will find the scheme of each entry and if there is authenticationinformation from that client for that scheme,matches(String id, StringaclExpr) will be called with id set to the authentication information that was previously added to the connection byhandleAuthenticationand aclExpr set to the id of the ACL entry. The authentication pluginuses its own logic and matching scheme to determine ifid is included inaclExpr.
There are two built in authentication plugins: ip and digest. Additional plugins can adding using system properties. At startup the ZooKeeper server will look for system properties that start with"zookeeper.authProvider." and interpret the value of those properties as the class name of an authentication plugin. These properties can be setusing the -Dzookeeeper.authProvider.X=com.f.MyAuth or adding entries such as the following in the server configuration file:
authProvider.1=com.f.MyAuth
authProvider.2=com.f.MyAuth2
Care should be taking to ensure that the suffix on the property is unique. If there are duplicates such as-Dzookeeeper.authProvider.X=com.f.MyAuth-Dzookeeper.authProvider.X=com.f.MyAuth2, only one will be used. Also all servers must have the same plugins defined, otherwise clients using the authentication schemes provided by the plugins will have problems connecting to some servers.
可插拔的ZooKeeper认证机制
ZooKeeper可以采用不同的认证方案,运行在各种不同的环境,所以它有一个完全可插拔的认证架构,即使内置的认证方案,也采用的是这一架构。
为理解认证架构是如何工作的,首先,你要明白两个主要的认证操作。架构首先要认证客户端,这通常发生在客户端一连上服务器的时刻,它包含认证客户端发过来的或从客户端收集到的身份信息,然后与连接关联起来。第二个架构要处理的操作是从一个ACL中找到与此客户端有关的项。ACL中的项是<idspec, permissions>对,其中idspec可以是一个简单的字符串,比对与连接相关的认证信息,它也可以是一个表达式,与认证信息进行比较。如何进行比较是可插件具体实现的责任,以下是认证插件必须实现的接口:
public interface AuthenticationProvider {
String getScheme();
KeeperException.Code handleAuthentication(ServerCnxn cnxn, byte authData[]);
boolean isValid(String id);
boolean matches(String id, String aclExpr);
boolean isAuthenticated();
}
第一个方法getScheme返回这个可插拔件的身份Id。因为我们支持多个认证方法,一种认证凭证或一个idspec总需要添加scheme:作为前缀。ZooKeeper服务器使用这个从可插件返回的scheme来决定哪些id用这个scheme前缀。
当客户端随着连接而发生过来认证信息时,handleAuthentication被调用。客户端指定与这个信息相关的scheme。ZooKeeper将这个认证信息传给插拔架构,这个可插件的getScheme需要与客户端传过来的scheme一致。如果这个信息不对,handleAuthentication的实现者通常返回一个错误,或者用cnxn.getAuthInfo().add(newId(getScheme(), data))将连接与信息关联。
可插件涉及了设置和ACL。当对一个节点设置了一个ACL时,ZooKeeper服务器将此项中的id部分传给isValid(String id)方法,验证这个id的格式是否正确是可插件的事情。例如,ip:172.16.0.0/16是一个合法的id,但ip:host.com不是。如果这个新的ACL包含一个”auth”项,isAuthenticated被用来看看是否将与此连接关联的有关此scheme的认证信息加到ACL中。一些scheme不应该被包含到auth。例如,如果auth被指定了,客户端的IP地址不应该被认为是一个需要加入到ACL的id。
与一个ACL做检查时,ZooKeeper调用matches(String id,String aclExpr),它要将客户端的认证信息与相关的ACL项进行对比,为了找出对比的ACL项,ZooKeeper服务器将找出每项的scheme,如果来自客户端的认证信息由该scheme,则matches(String id, StringaclExpr)被调用,其中,id是前面由handleAuthentication加入到连接的认证信息,aclExpr是ACL项的id,可插件运用它自己的逻辑和匹配规则来决定这个id是否包含在aclExpr中。
有两个内置的认证可插件:ip和digest。额外的可插件可以用系统参数来添加。在启动的时候,ZooKeeper服务器会搜索以” zookeeper.authProvider”为起始的系统参数,并将这些参数的值解释为认证插件的类名。这些参数可以采用如下方式设置:-Dzookeeeper.authProvider.X=com.f.MyAuth,或者在服务器的配置文件中添加如下项:
authProvider.1=com.f.MyAuth
authProvider.2=com.f.MyAuth2
需要小心的是这些值的后缀应保证唯一,如果有重复,例如
-Dzookeeeper.authProvider.X=com.f.MyAuth
-Dzookeeper.authProvider.X=com.f.MyAuth2
只应用一个。另外,所有的服务器应该有相同的插件定义,否则,在连接某些服务器时,客户端采用插件提供的认证方案时会出问题。
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations aredesigned to be fast, though reads are faster than writes. The reason for thisis that in the case of reads, ZooKeeper can serve older data, which in turn isdue to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single SystemImage
A client will see the same view of the service regardless of the server that itconnects to.
Reliability
Once an update has been applied, it will persist from that time forward until aclient overwrites the update. This guarantee has two corollaries:
- If a client gets a successful return code, the update will have been applied. Onsome failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but theonly guarantee is only present with successful return codes. (This is calledthemonotonicity condition in Paxos.)
- Any updates that are seen by the client, through a read request or successfulupdate, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen bya client within this bound, or the client will detect a service outage.
Using thesec onsistency guarantees it is easy to build higher level functions such as leader election, barriers, queues, and read/write revocable locks solely at the ZooKeeper client (no additions needed to ZooKeeper). SeeRecipes andSolutions for more details.
So, ZooKeeper by itself doesn't guarantee that changes occur synchronously across all servers, but ZooKeeper primitives can be used to construct higher level functions that provide useful client synchronization. (For more information, see the ZooKeeper Recipes. [tbd:..]).
一致性保证
ZooKeeper是一个高性能、高可扩展服务,读和写都被设计得很快,当然,读的速度比写更快一些,原因在于读时,ZooKeeper依然可以提供旧数据服务,之所以能这样做,是由于ZooKeeper的如下一致性保证:
顺序一致性:
来自于客户端的更新是根据它们发送的先后顺序进行的。
原子性
更新要么成功,要么失败—没有中间结果
单一的系统映像
一个客户端无论与哪个服务器连接,它所看到的服务场景都是一样的。
可靠性
一旦一个更新被完成后,它的状态将一直保持,直到客户端覆盖了这个更新。这个保证有两个推论:
- 如果一个客户端得到了一个成功的返回,那么这个更新已经完成了。在某些故障情况下(通讯故障、超时等),客户端不会知道更新是否完成。我们可以采取措施减小故障,但是成功的返回码的唯一的保证(在Paxos中,这叫做单一性条件)。
- 客户端通过读请求或成功的更新操作看到的所有更新,不会随着服务器(从故障中)恢复而回滚。
时效性
在一个时间范围内,客户端看到的系统保证是最新的(数十秒级别),在此期间,或者系统的变化被客户端看到,或者客户端检查到服务中断。
使用这些一致性保证,仅仅在客户端(对ZooKeep来说,不需要额外的东西)就很容易构建更高级的功能,例如leader选举,壁垒,排队以及可撤销的read/write锁。详情参见Recipes andSolutions。
所以,ZooKeeper本身不保证变化在服务器间同步发生,但ZooKeeper原语能够被用来构建更高层的功能,来提供客户端同步(详情参见
ZooKeeper Recipes. [待完成:..]).
Bindings
The ZooKeeper client libraries come in two languages: Java and C. The following sectionsdescribe these.
绑定
ZooKeeper客户端库有两种语言:Java和C。以下节描述它们。
Java Binding
There are two packages that make up the ZooKeeper Java binding:org.apache.zookeeperand org.apache.zookeeper.data. The rest of the packages that make up ZooKeeper are used internally or are part of the server implementation. Theorg.apache.zookeeper.datapackage is made up of generated classes that are used simply as containers.
The main class used by a ZooKeeper Java client is theZooKeeper class. Its two constructors differ only by an optional session id and password. ZooKeeper supports session recovery accross instances of a process. A Java program maysave its session id and password to stable storage, restart, and recover the session that was used by the earlier instance of the program.
When a ZooKeeper object is created, two threads are created as well: an IO thread andan event thread. All IO happens on the IO thread (using Java NIO). All event callbacks happen on the event thread. Session maintenance such as reconnecting to ZooKeeper servers and maintaining heart beat is done on the IO thread.Responses for synchronous methods are also processed in the IO thread. All responses to asynchronous methods and watch events are processed on the event thread. There are a few things to notice that result from this design:
- All completions for asynchronous calls and watcher callbacks will be made in order, one at a time. The caller can do any processing they wish, but no other callbacks will be processed during that time.
- Callbacks do not block the processing of the IO thread or the processing of the synchronous calls.
- Synchronous calls may not return in the correct order. For example, assume a client does the following processing: issues an asynchronous read of node /a with watch set to true, and then in the completion callback of the read it does a synchronous read of /a. (Maybe not good practice, but not illegal either, and it makes for a simple example.)
Note that if there is a change to /a between the asynchronous read and thesynchronous read, the client library will receive the watch event saying /a changed before the response for the synchronous read, but because thecompletion callback is blocking the event queue, the synchronous read willreturn with the new value of /a before the watch event is processed.
Finally, the rules associated with shutdown are straightforward: once a ZooKeeper object isclosed or receives a fatal event (SESSION_EXPIRED and AUTH_FAILED), the ZooKeeper object becomes invalid. On a close, the two threads shut down and anyfurther access on zookeeper handle is undefined behavior and should be avoided.
Java 绑定
ZooKeeper的Java绑定有两个包:org.apache.zookeeper和 org.apache.zookeeper.data. 构成ZooKeeper的其他包或者是内部使用,或者是服务器端实现使用。org.apache.zookeeper.data包由生成的类组成,这些类可以仅用作容器。
ZooKeeperJava客户端最主要的类是ZooKeeper,它的两个构造函数的区别仅在于可选的会话id和password。在进程内,ZooKeeper支持跨实例的会话恢复,Java程序可以将会话id和password保存到持久的存储中,重启后,能恢复以前实例的会话。
当一个ZooKeeper对象被创建,两个线程也被同时创建:一个IO线程和一个事件线程。所有IO发生在IO线程(采用JavaNIO)。所有事件回调发生在事件线程。会话的维护,例如与ZooKeeper的重连接和维护心跳,发生在IO线程。同步应答也在IO线程处理。所有对异步方法和监视器事件的应答都在事件线程处理。对于这种设计,应注意以下事情:
- 所有异步方法的调用和监视器的回调都是顺序的,一次一个。调用者可以做任何它想的事情,但这期间,不会处理其他回调方法。
- 回调函数不会阻塞IO线程的处理和同步调用。
- 同步调用可能不会按顺序返回。例如,一个客户端要做以下事情:对节点/a发出一个异步读,同时设置监视器,然后在读回调完成时,它对节点/a进行同步读(可能没有实际意义,但也不违法,它仅是一个简单的例子)。
注意,对节点/a,如果在异步读和同步读之间发生了变化,客户端库在得到同步读响应前,会接收到一个监视器事件,说/a变化了,但是由于异步读回调阻塞了事件队列,同步读会返回/a的新值,然后监视器事件才被处理。
最后,与shutdown相关联的规则很直白:一旦一个ZooKeeper对象关闭或者收到一个严重事件(SESSION_EXPIRED和AUTH_FAILED), ZooKeeper对象就变成无效了。关闭后,这两个线程也停止了,任何在ZooKeeper句柄上的操作就变得不可预测,应避免这种情况出现。
C Binding
The C binding has a single-threaded and multi-threaded library. The multi-threaded library is easiest to use and is most similar to the Java API. This library will create anIO thread and an event dispatch thread for handling connection maintenance andcallbacks. The single-threaded library allows ZooKeeper to be used in eventdriven applications by exposing the event loop used in the multi-threadedlibrary.
The package includes two shared libraries: zookeeper_st and zookeeper_mt. The former only provides the asynchronous APIs and callbacks for integrating into the application's event loop. The only reason this library exists is to support theplatforms were a pthread library is not available or is unstable (i.e.FreeBSD 4.x). In all other cases, application developers should link withzookeeper_mt, as it includes support for both Sync and Async API.
C绑定
C绑定有一个单线程库和多线程库。多线程库用起来最简单,并且与JavaAPI很相似。这个库将创建一个IO线程和事件分发器线程,后者处理连接维护和回调。单线程库允许ZooKeeper用在事件驱动应用程序中,此时,它暴露事件循环,这与多线程库中的一样。
程序包包含两个共享库:zookeeper_st和zookeeper_mt,前者仅提供异步API和回调函数,它们可以整合到应用程序的事件循环中。这个库存在的唯一理由是它是针对那些不支持pthread库或pthread库运行不稳定的平台(即FreeBSD 4.x)。其他情况下,程序开发者应链接zookeeper_mt,它同时支持同步和异步API。
Installation
If you're building the client from a check-out from the Apache repository, follow the steps outlined below. If you're building from a project source packagedownloaded from apache, skip to step 3.
- Run ant compile_jute from the ZooKeeper top level directory (.../trunk). This will create a directory named "generated" under .../trunk/src/c.
- Change directory to the.../trunk/src/c and run autoreconf -if to bootstrapautoconf, automake and libtool. Make sure you haveautoconf version 2.59 or greater installed. Skip to step 4.
- If you are building from a project source package, unzip/untar the source tarball and cd to the zookeeper-x.x.x/src/c directory.
- Run ./configure <your-options> to generate the makefile. Here are some of options theconfigure utility supports that can be useful in this step:
- --enable-debug
Enables optimization and enables debug info compiler options. (Disabled by default.)
- --without-syncapi
DisablesSync API support; zookeeper_mt library won't be built. (Enabled by default.)
- --disable-static
Donot build static libraries. (Enabled by default.)
- --disable-shared
Donot build shared libraries. (Enabled by default.)
NoteSee INSTALL for general information about running configure.. - Run make or make install to build the libraries and install them.
- To generate doxygen documentation for the ZooKeeper API, run make doxygen-doc. All documentation will be placed in a new subfolder named docs. By default, this command only generates HTML. For information on other document formats, run ./configure --help
安装
如果你是通过从Apache库中采用check-out操作来构建客户端,参考以下大致的步骤,如果你是从Apache下载源代码包构建客户端,跳到步骤3。
- 在ZooKeeper顶层目录(.../trunk)运行ant compile_jute. 这将在.../trunk/src/c 目录下创建一个叫"generated"的目录。
- 改变目录到.../trunk/src/c ,运行autoreconf –如果为了引导 autoconf, automake 和libtool。请确保已安装autoconf,且版本不小于2.59。跳到步骤4。
- 如果你是从源代码包构建,unzip/untar源代码,cd 到zookeeper-x.x.x/src/c目录。
- 运行./configure <your-options> ,产生makefile文件,这里有些configure工具支持的选项,可以用于这一步骤:
- --enable-debug
编译器选项,允许优化及debug信息(缺省是不允许)
- --without-syncapi
不支持同步API,不生成zookeeper_mt库 (缺省是支持)
- --disable-static
不生成静态库(缺省是生成)
- --disable-shared
不生成共享库(缺省是生成)
注关于运行configure的一般信息,参阅INSTALL . - 运行make 或 make install 生成库,安装它们。
- 为了生成ZooKeeper API 的doxygen文档, 运行make doxygen-doc。所有文档将被放到一个叫docs的新的子目录下。缺省情况下,这个命令仅产生HTML文档,关于其他格式文档的信息,请运行./configure --help 。
Using the C Client
You can test your client by running a ZooKeeper server (see instructions on the project wikipage on how to run it) and connecting to it using one of the cli applicationsthat were built as part of the installation procedure. cli_mt (multithreaded,built against zookeeper_mt library) is shown in this example, but you couldalso use cli_st (single threaded, built against zookeeper_st library):
$ cli_mt zookeeper_host:9876
This is a client application that gives you a shell for executing simple ZooKeeper commands.Once successfully started and connected to the server it displays a shell prompt. You can now enter ZooKeeper commands. For example, to create a node:
> create /my_new_node
To verify that the node's been created:
> ls /
You should see a list of node who are children of the root node "/".
In order to be able to use the ZooKeeper API in your application you have to remember to
- Include ZooKeeper header: #include <zookeeper/zookeeper.h
- If you are building a multithreaded client, compile with -DTHREADED compiler flag to enable the multi-threaded version of the library, and then link against against thezookeeper_mt library. If you are building a single-threaded client, do not compile with -DTHREADED, and be sure to link against the zookeeper_stlibrary.
Refer to ProgramStructure, with Simple Example for examplesof usage in Java and C.[tbd]
使用C客户端
测试客户端,你先运行一个ZooKeeper服务器(关于运行它的指令,请参阅项目的wiki页),然后采用前面安装过程中生成的某个cli应用程序连接它。下面的例子采用cli_mt(多线程,采用zookeeper_mt库生成的),但你也可以用cli_st(单线程,采用zookeeper_st库生成的):
$ cli_mt zookeeper_host:9876
这是一个客户端程序,它为你提供了一个shell,你可以运行简单的ZooKeeper命令,一旦成功启动并连接到服务器,它显示一个shell提示符,你可以键入ZooKeeper命令,例如,创建一个节点:
> create /my_new_node
验证这个节点确实被创建:
> ls /
你将看到根节点”/”下子节点的列表。
为了能在你的应用程序中使用ZooKeeper的API,你应该记住:
- 包含ZooKeeper头文件:#include<zookeeper/zookeeper.h>
- 如果你要生成多线程客户端,编译采用-DTHREADED选项,以允许库的多线程版,链接时,链接zookeeper_mt库。如果你要生成单线程客户端,编译不要采用-DTHREADED选项,并确保链接的是zookeeper_st库。
Building Blocks: A Guide to ZooKeeper Operations
This sectionsurveys all the operations a developer can perform against a ZooKeeper server.It is lower level information than the earlier concepts chapters in this manual, but higher level than the ZooKeeper API Reference. It covers thesetopics:
Handling Errors
Both the Java and C client bindings may report errors. The Java client binding does so by throwing KeeperException, calling code() on the exception will return the specific error code. The C client binding returns an error code as defined inthe enum ZOO_ERRORS. API callbacks indicate result code for both language bindings. See the API documentation (javadoc for Java, doxygen for C) for fulldetails on the possible errors and their meaning.
Connecting to ZooKeeper
Read Operations
Write Operations
Handling Watches
Miscelleaneous ZooKeeper Operations
构建积木:ZooKeeper操作指导
这一节调查了一个开发人员能用到的所有ZooKeeper服务器的操作。与本手册前面章节相比,它是更底层的信息,但比ZooKeeperAPI参考信息高,它包含如下主题:
- 连接到ZooKeeper
处理错误
Java和C客户端都会报告错误,Java客户端是通过抛出KeeperException异常的方式,在异常处理中调用code()会返回特定的错误编号;C客户端返回一个错误编号,编号在ZOO_ERRORS枚举类型中定义。对两种语言绑定,API回调都由结果码指示调用结果。关于可能的错误值及其意义的详细信息,参阅API文档(对java是javadoc,对C是doxygen)。
连接到ZooKeeper
读操作
写操作
处理监视器
其他ZooKeeper选项
ProgramStructure, with Simple Example
[tbd]
程序结构和简单的实例
【待完成】
Gotchas: Common Problems and Troubleshooting
So now you know ZooKeeper. It's fast, simple, your application works, but wait ...something's wrong. Here are some pitfalls that ZooKeeper users fall into:
- If you are using watches, you must look for the connected watch event. When a ZooKeeper client disconnects from a server, you will not receive notification of changes until reconnected. If you are watching for a znode to come into existance, you will miss the event if the znode is created and deleted while you are disconnected.
- You must test ZooKeeper server failures. The ZooKeeper service can survive failures as long as a majority of servers are active. The question to ask is: can your application handle it? In the real world a client's connection to ZooKeeper can break. (ZooKeeper server failures and network partitions are common reasons for connection loss.) The ZooKeeper client library takes care of recovering your connection and letting you know what happened, but you must make sure that you recover your state and any outstanding requests that failed. Find out if you got it right in the test lab, not in production - test with a ZooKeeper service made up of a several of servers and subject them to reboots.
- The list of ZooKeeper servers used by the client must match the list of ZooKeeper servers that each ZooKeeper server has. Things can work, although not optimally, if the client list is a subset of the real list of ZooKeeper servers, but not if the client lists ZooKeeper servers not in the ZooKeeper cluster.
- Be careful where you put that transaction log. The most performance-critical part of ZooKeeper is the transaction log. ZooKeeper must sync transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely effect performance. If you only have one storage device, put trace files on NFS and increase the snapshotCount; it doesn't eliminate the problem, but it can mitigate it.
- Set your Java max heap size correctly. It is very important toavoid swapping. Going to disk unnecessarily will almost certainly degrade your performance unacceptably. Remember, in ZooKeeper, everything is ordered, so if one request hits the disk, all other queued requests hit the disk.
To avoid swapping, try to set the heap size to the amount of physical memory youhave, minus the amount needed by the OS and cache. The best way to determine an optimal heap size for your configurations is torun load tests. If for some reason you can't, be conservative in your estimates and choose a number well below the limit that would cause your machine to swap. For example, on a4G machine, a 3G heap is a conservative estimate to start with.
陷阱:常见问题及其解决
你现在以及了解ZooKeeper了,它快速、简单,你的应用程序工作正常,但等等…有些东西错了。这里是一些ZooKeeper用户可能掉入的陷阱:
- 如果你使监视器,你必须关注连接监视器事件。当一个ZooKeeper客户端与服务器断开时,在重新连接之前,你不会得到变化的通知。如果你正监视一个节点的存在,在你断开的时间内,你会失去这个节点是否被创建或删除的事件。
- 你必须测试ZooKeeper服务器是否失效。ZooKeeper服务能继续有效,只要大多数服务器是有效的。问题是:你的应用程序能处理它吗?现实中,客户端与ZooKeeper的连接可能断掉(ZooKeeper服务器故障和网络故障是导致连接断掉的常见原因)。ZooKeeper客户端库关注了恢复你的连接并且让你知道什么事情发生了,但你必须确保你恢复了你的状态以及失败了的请求。在试验环境而不是生产环境下,确保它们是正确的—在一个由几台机器组成的ZooKeeper上测试,使机器重启。
- 客户端使用的ZooKeeper服务器列表必须与每一台ZooKeeper服务器上的一致。如果客户端使用的是服务器上的一个子集,也可以工作,尽管这样不会得到优化结果,但如果客户端列表多于ZooKeeper集群,就不会工作了。
- 小心放置你的事务日志。ZooKeeper中性能的关键就是事务日志,ZooKeeper在应答前,必须将事务同步到介质。一个专门的事务日志设备是保证性能稳定的关键。将日志放到一个繁忙的设备将破坏性能。如果你仅有一个存储设备,将trace文件放到NFS并增加snapshotCount,这样不能消除问题,但可以迁移问题。
- 正确设置你的最大java堆,这对避免内存交换很重要。没有必要的磁盘访问将肯定会降低你的性能到不可接受的程度。记住,在ZooKeeper中,所有的东西都是顺序的,所以,如果一个请求访问了磁盘,也就意味着其它排队的请求也访问了磁盘。
为了避免内存交换,将heapsize设置为物理内存的大小减去操作系统和缓存的大小,优化heapsize最好的办法是运行负荷试验。如果出于某些原因,你不能这样做,那么对你的估算保守一些,选择一个刚好低于使你的机器产生内存交换的值,例如,在一个4G的机器上,选择3G作为保守堆大小的起点。
Outside the formal documentation, there're several other sources of information for ZooKeeper developers.
ZooKeeperWhitepaper [tbd: find url]
The definitive discussion of ZooKeeper design and performance, by Yahoo! Research
API Reference[tbd: find url]
The complete reference to the ZooKeeper API
ZooKeeper Talkat the Hadoup Summit 2008
Avideo introduction to ZooKeeper, by Benjamin Reed of Yahoo! Research
The excellent Java tutorial by Flavio Junqueira, implementing simple barriers andproducer-consumer queues using ZooKeeper.
ZooKeeper - AReliable, Scalable Distributed Coordination System
An article by Todd Hoff (07/15/2008)
Pseudo-level discussion of the implementation of various synchronization solutions withZooKeeper: Event Handles, Queues, Locks, and Two-phase Commits.
[tbd]
Any other good sources anyone can think of...
除了正式的官方文档外,还有一些其他的信息源,供ZooKeeper开发者参考。
ZooKeeperWhitepaper [url待定]
它明确地讨论了ZooKeeper的设计和性能,由Yahoo!Research编写
API Reference[url待定]
ZooKeeperAPI的完整参考
ZooKeeper Talkat the Hadoup Summit 2008
Yahoo!Research的Benjamin Reed讲的一个介绍ZooKeeper的视频
Theexcellent Java tutorial by Flavio Junqueira编写的一个优秀的Java教程,用ZooKeeper实现了一个简单的壁垒(barriers)以及producer-consumer 模式的队列。
ZooKeeper - AReliable, Scalable Distributed Coordination System
ToddHoff (07/15/2008)写的一篇文章
采用ZooKeeper来实现的各种同步方案(模拟级别的讨论):事件处理,队列,锁及两段提交。
更多推荐
所有评论(0)