Translators are one of the core concepts of GlusterFS, and are consideredby some to be the feature most clearly differentiating it from otherwisesimilar parallel filesystems. It is therefore highly unfortunate that theauthors have not seen fit to share any information about the interface thatevery translator needs to use. This API is broadly similar to that fromFUSE1, and readersshould be familiar with that API, but Gluster's implementation has many of itsown quirks. For example, all translator functions must use anasynchronous calling convention reminiscent of SEDA2 or STREAMS from AT&T UNIX; therequired call-frame and call-stack structures are functionally almost identicalto I/O Request Packets and IRP stacks in Windows NT. Many of thetranslator functions start out resembling their FUSE counterparts, but withextra or modified arguments. These matters all requires some explanation,as do some of the facilities available to translator hackers, and this documentis my attempt to provide that explanation based on my own exploration of theGlusterFS 3.0 code.
1.1 Call Frames and CallStacks
These are some of the most fundamental structures in the translator API,defined in stack.h ascall_frame_t and call_stack_t respectively. The translator environment can be thought of as a thread library, witheach request getting its own very lightweight thread. This threadincludes its own stack, but these call stacks are more structured than the Cstack and occasionally depart from the strict FIFO model of a stack. Eachcall frame represents a function call, and function calls may be nested, butthese translator-level function calls are actually two calls in C - an initialpart that executes (and returns) before making any nested calls, and a finalpart that executes after such nested calls complete. The call_frame_t fieldsto support this are as follows.
.root points to a dummy frame embedded in the call_stack_t (unlikeall the others which are allocated separately) and is the same for all frameswithin the stack.
.parent points to the frame that invoked this one. This would be thecalling function in C terms.
.next and prev link to the next and previous frames to becompleted (not invoked, which would be the opposite order). See below foran explanation of why neither of these is equivalent to parent.
.local contains private data for this frame invocation. These arelocal variables to the translator function, stored here to persist despite thesplitting of that into two C functions with nested translator calls in between.
.this points to the translator's private “global” state. Since theremight be multiple instances of a translator, this is actually more similar toclass members in C++ than to true globals.
.ret points to the second half of the caller's translatorfunction, to be executed when this nested call completes3.
.ref_count and complete are used by the scheduler to keeptrack of which frames are active, which have completed, and which need to beresumed. See the section on STACK_WIND and STACK_UNWIND for moredetails.
.lock and cookie I haven't completely figured out yet(TBD).
Under normal conditions, a new frame is pushed onto the stack sothat next is the same as parent. However, aframe can also branch. For example, the frame for stripe_writev createsone new frame for each of its subvolumes before it returns. With threesubvolumes, next would be the same as parentonlyfor the first new frame.
Following the thread-library model, the call_stack_t correspondsto a thread control block. Some of its useful fields include these.
.uid and gid identify who the call came from, and can beused for authentication or authorization.
.pid identifies the process that initiated the request. I'm notsure what this is good for, but it's there.
.trans points to the transport structure associated with the request. This can be used on the server to implement node-based access control, orto track down requests on failed connections; I'm not sure how useful it is (oreven if it's valid) on the client side.
.frames is a dummy frame, which points to the first real frame for therequest.
.Either op or type identify what type ofrequest this is, but it's not clear which.
There are other fields as well, but I don't know what they're for (TBD). Relatedly, I have no idea what a call_pool_t is for(TBD).
1.2 STACK_WIND and STACK_UNWIND
.frame isa pointer to your own call_frame_t.
.obj isa pointer to the translator instance (e.g. one per subvolume) for the nestedcall.
1.3 Inode and FileDescriptor Context
In addition to storing context in a frame's local field,there are methods for attaching a limited amount of extra information toan inode_t or fd_t. This can be handy forinformation produced or consumed in utility functions that already havepointers to these objects but not to the frame. These are simplemaps/dictionaries, where the key is a pointer to the translator and the valueis an unsigned 64-bit integer. Thus, for an fd we have:
.fd_ctx_set (fd_t *, xlator_t *, uint64_t)
.fd_ctx_get (fd_t *, xlator_t *, uint64_t *)
.fd_ctx_del (fd_t *, xlator_t *, uint64_t *)4
There is a similar set of functions for inodes, plus functions (e.g.inode_ctx_set2) to get/set two values instead of just one.
It's important to keep in mind that an fd_t or inode_t hasonly a limited number of these context slots, and only one may be used by eachtranslator. Therefore, there are many subtle opportunities for the tableto overflow, or for code in one part of a complex translator to overwrite thevalues needed by another5, etc. There also seems to be a bitof a bug in the implementation of these functions, in that they use a loopindex bounded by xlator->ctx->xl_count as an indexinto inode->_ctxeven though the bound could exceed the arraysize. I think this works because xl_count never happensto be set that high and these features aren't used everywhere, but it stillseems very brittle (TBD).
Calling Conventions
STACK_WIND and STACK_UNWIND both enforce a certain consistency in thecalling conventions for translator functions, though some variation is alsopossible. Initial functions (e.g. foo_writev) all take the same first twoarguments:
.frame is a pointer to the current call frame.
.this is a pointer to the current translator. It's not clear howthis could ever be anything other than frame->this (TBD).
For completion functions (e.g. foo_writev_cbk) the convention is a bitdifferent, with four common parameters instead of two.
.frame is the same as for the initial function, pointing to the framethat's being resumed.
.cookie is usually the same if STACK_WIND was used, though it could bedifferent if the STACK_WIND_COOKIE variant was used. It's extracted fromthe frame and passed on by STACK_UNWIND, and might be another place to storesome request-specific data.
.this is also the same as for the initial function, also pointing to theresumed frame.
.op_ret is the return value from the nested call. There is also op_errno,presumably mirroring the usage of return values and errno for system or libccalls.
Each individual function might also take additional parameters, varying ineach case, so check xlator.h for specific details.
2http://www.eecs.harvard.edu/~mdw/proj/seda/
3 Yes, it's confusing to put a pointer tothe caller's function in the callee's frame, but that's the way it is.
4 Yes, the delete function fetches thevalue first, making it more of a destructive fetch than a pure delete.
5 Note that the key is a translator instance ratherthan a translator type, so at least multiple instances of the sametranslator won't step on each other.