Swift Architectural Overview
.. TODO - add links to more detailed overview in each section below.
Proxy Server
The Proxy Server is responsible for tying together the rest of the Swift
architecture. For each request, it will look up the location of the account,
container, or object in the ring (see below) and route the request accordingly.
The public API is also exposed through the Proxy Server.
proxy server 负责连接swift的其他构件。对每次请求(request) ,proxy server都查询ring中的account,container或者object(account,container, or object in the ring in the ring),并转送请求(route the request )。公开的api也是通过proxy server暴露给外部。

A large number of failures are also handled in the Proxy Server. For
example, if a server is unavailable for an object PUT, it will ask the
ring for a handoff server and route there instead.
prox server 同样处理大量的"请求失败"(译注:并不是所有的请求都会成功)。比如,当一个object put时(译注:推送,现在看来是用来转移数据),server(译注:这里是指目标server)不可达,proxy server会通知ring尝试访问其他可接手的server(handoff server)并转送请求。

When objects are streamed to or from an object server, they are streamed
directly through the proxy server to or from the user -- the proxy server
does not spool them.
prox server并不干预(译注:spool,缓冲)对象的数据流在object sever和user之间传递(译注:实际上,proxy只是作为"请求信息"的主控结点,而不参与数据传送)。

The Ring
A ring represents a mapping between the names of entities stored on disk and
their physical location. There are separate rings for accounts, containers, and
objects. When other components need to perform any operation on an object,
container, or account, they need to interact with the appropriate ring to
determine its location in the cluster.
ring代表存储在硬盘上的实体(entity)名称和实际物理位置的映射。accounts, containers,objects都有各自的ring。当其它组件需要对object,container或者account操作时,需要使用各自的ring去确定各自在集群上的位置。

The Ring maintains this mapping using zones, devices, partitions, and replicas.
Each partition in the ring is replicated, by default, 3 times across the
cluster, and the locations for a partition are stored in the mapping maintained
by the ring. The ring is also responsible for determining which devices are
used for handoff in failure scenarios.
ring 使用zones,devices,partitions和replicas来维护这些映射(mapping)信息。每个ring中的partition在集群中都(默认)有3个replica(副本)。

Data can be isolated with the concept of zones in the ring. Each replica
of a partition is guaranteed to reside in a different zone. A zone could
represent a drive, a server, a cabinet, a switch, or even a datacenter.

The partitions of the ring are equally divided among all the devices in the
Swift installation. When partitions need to be moved around (for example if a
device is added to the cluster), the ring ensures that a minimum number of
partitions are moved at a time, and only one replica of a partition is moved at
a time.

Weights can be used to balance the distribution of partitions on drives
across the cluster. This can be useful, for example, when different sized
drives are used in a cluster.

The ring is used by the Proxy server and several background processes
(like replication).
ring 被proxy server和一些后台进程(比如replication)使用。

Object Server
The Object Server is a very simple blob storage server that can store,
retrieve and delete objects stored on local devices. Objects are stored
as binary files on the filesystem with metadata stored in the file's
extended attributes (xattrs). This requires that the underlying filesystem
choice for object servers support xattrs on files. Some filesystems,
like ext3, have xattrs turned off by default.
object server 是个非常简单的大对象(blob)存储server,可以用来操作(检索和删除)本地device上的object。object以二进制文件的形式和元数据(metadata)存储在文件统(filesystem)上,元数据放在文件系统的扩展属性(xattrs)中。这潜在的要求object server需要支持有扩展属性(xattrs)的文件系统(file system)。一些文件系统,像ext3,默认的xattrs属性是关闭着的。

Each object is stored using a path derived from the object name's hash and
the operation's timestamp. Last write always wins, and ensures that the
latest object version will be served. A deletion is also treated as a
version of the file (a 0 byte file ending with ".ts", which stands for
tombstone). This ensures that deleted files are replicated correctly and
older versions don't magically reappear due to failure scenarios.
每个object名字hash之后加上操作时的时间戳组成object的存储路径名。last write(上次的写操作)一定是成功的,并要确保最新的object已经可以对外(提供)服务(读、写)。删除(操作)也作为文件的一个版本(一个0 byte文件,以.ts为后缀,代表tombstone墓碑)。这确保被删除的文件副本被正确删除,而不是在不恰当的时候意外出现(比如删除失败due to failure scenarios)。

Container Server
The Container Server's primary job is to handle listings of objects. It
doesn't know where those object's are, just what objects are in a specific
container. The listings are stored as sqlite database files, and replicated
across the cluster similar to how objects are. Statistics are also tracked
that include the total number of objects, and total storage usage for that
主要工作为处理object的列表动作(listing).container server 并不知道object存在哪,只知道指定container里存的哪些object。这些列表以sqlite数据库文件的形式存储,类似object一样在集群上做备份。container server也跟踪(trace)做一些统计,比如object的个数,container的使用情况。

Account Server
The Account Server is very similar to the Container Server, excepting that
it is responsible for listings of containers rather than objects.
除了负责处理container的列表动作(listing),account server和container server是非常相似的。

Replication is designed to keep the system in a consistent state in the face
of temporary error conditions like network outages or drive failures.
replication被设计用来保证系统故障时(比如网络瘫痪、或drive 宕机)的数据一致性。

The replication processes compare local data with each remote copy to ensure
they all contain the latest version. Object replication uses a hash list to
quickly compare subsections of each partition, and container and account
replication use a combination of hashes and shared high water marks.
replication进程比较本地数据和远程拷贝,确保他们都包含最新的文件版本。object replication用hash表快速比较每个partition的子段(subsection);container replication和account replication使用hash 和共享的高水位线(high water marks)进行文件版本的比较。

Replication updates are push based. For object replication, updating is
just a matter of rsyncing files to the peer. Account and container
replication push missing records over HTTP or rsync whole database files.

replication更新时,replication是以push(推动)为基础。对object replication来说,更新是传输一些rsync同步文件(而不是新的全部文件)到各个结点。account server 和container server则使用http或rsync补全整个数据库文件上丢失的记录。
The replicator also ensures that data is removed from the system. When an
item (object, container, or account) is deleted, a tombstone is set as the
latest version of the item. The replicator will see the tombstone and ensure
that the item is removed from the entire system.

replicator同样应该确保被删除的数据确实从系统中删除了。当一项(object,container,account)被删除掉,则这项的最新的版本标志(as the latest version)被设置成tombstone(译注:也许并没有被删除,只是"标记"为删除)。replication将能看到tombstone,并确保置为tombstone的项已从整个系统中删除掉了.

There are times when container or account data can not be immediately
updated. This usually occurs during failure scenarios or periods of high
load. If an update fails, the update is queued locally on the filesystem,
and the updater will process the failed updates. This is where an eventual
consistency window will most likely come in to play. For example, suppose a
container server is under load and a new object is put in to the system. The
object will be immediately available for reads as soon as the proxy server
responds to the client with success. However, the container server did not
update the object listing, and so the update would be queued for a later
update. Container listings, therefore, may not immediately contain the object.
总有很多情况container和account中的数据不会被立即更新(update)。这种情况经常发生在系统故障或者是高负荷的情况下。如果一次更新(update)失败,(按上下文,update操作一般是远程的update)会排队(queue)请求本地文件系统进行更新,这时updater会继续尝试更新(update)工作。这时最终一致性间隔(eventual consistency window)将会起作用。(This is where an eventual consistency window will most likely come in to play。)例如,假设一个container server被加载之后,一个新的object被加入到系统。当proxy server 回应client请求(说明已写成功,client之后请求读),这个object应该是立即可读的。但是container server 并没有更新object列表。因此更新(update)将排队(queue)等待延后的更新(update)。container 列表不可能立即就包含这个新的object。

In practice, the consistency window is only as large as the frequency at
which the updater runs and may not even be noticed as the proxy server will
route listing requests to the first container server which responds. The
server under load may not be the one that serves subsequent listing
requests -- one of the other two replicas may handle the listing.
实际使用中,一致性间隔(也称一致性窗口,the consistency window)的大小和updater的运行频度一致,但如同proxy server 会转送(route)列表请求(listing request)给第一个响应的container server一样,updater的工作并不会被注意到。当然正在被加载(under load)的server不应该响应后续的列表请求(listing requests),其他2个(如果副本为2)中的一个应该处理这些列表请求。

Auditors crawl the local server checking the integrity of the objects,
containers, and accounts. If corruption is found (in the case of bit rot,
for example), the file is quarantined, and replication will replace the bad
file from another replica. If other errors are found they are logged (for
example, an object's listing can't be found on any container server it
should be).
auditors会在本地服务器上反复的“爬”,以保证object、container、account的完整性。一旦发现不完整的数据(即使是bit级的差异),该文件就会被隔离,然后replication会从其他的副本那里把“问题文件”替换。如果其他错误出现(比如在任何一个container server 中都找不到所需的object),还会记录进日志。