1 GlusterFS Translator API

 

Translators are one of the core concepts of GlusterFS, and are consideredby some to be the feature most clearly differentiating it from otherwisesimilar parallel filesystems.  It is therefore highly unfortunate that theauthors have not seen fit to share any information about the interface thatevery translator needs to use.  This API is broadly similar to that fromFUSE1, and readersshould be familiar with that API, but Gluster's implementation has many of itsown quirks.  For example, all translator functions must use anasynchronous calling convention reminiscent of SEDA2 or STREAMS from AT&T UNIX; therequired call-frame and call-stack structures are functionally almost identicalto I/O Request Packets and IRP stacks in Windows NT.  Many of thetranslator functions start out resembling their FUSE counterparts, but withextra or modified arguments.  These matters all requires some explanation,as do some of the facilities available to translator hackers, and this documentis my attempt to provide that explanation based on my own exploration of theGlusterFS 3.0 code.

 

1.1   Call Frames and CallStacks

 

These are some of the most fundamental structures in the translator API,defined in stack.h ascall_frame_t and call_stack_t respectively. The translator environment can be thought of as a thread library, witheach request getting its own very lightweight thread.  This threadincludes its own stack, but these call stacks are more structured than the Cstack and occasionally depart from the strict FIFO model of a stack.  Eachcall frame represents a function call, and function calls may be nested, butthese translator-level function calls are actually two calls in C - an initialpart that executes (and returns) before making any nested calls, and a finalpart that executes after such nested calls complete.  The call_frame_t fieldsto support this are as follows.

 

.root points to a dummy frame embedded in the call_stack_t (unlikeall the others which are allocated separately) and is the same for all frameswithin the stack. 

.parent points to the frame that invoked this one.  This would be thecalling function in C terms. 

.next and prev link to the next and previous frames to becompleted (not invoked, which would be the opposite order).  See below foran explanation of why neither of these is equivalent to parent. 

.local contains private data for this frame invocation.  These arelocal variables to the translator function, stored here to persist despite thesplitting of that into two C functions with nested translator calls in between. 

.this points to the translator's private “global” state.  Since theremight be multiple instances of a translator, this is actually more similar toclass members in C++ than to true globals. 

.ret points to the second half of the caller's translatorfunction, to be executed when this nested call completes3. 

.ref_count and complete are used by the scheduler to keeptrack of which frames are active, which have completed, and which need to beresumed.  See the section on STACK_WIND and STACK_UNWIND for moredetails. 

.lock and cookie I haven't completely figured out yet(TBD). 

 

Under normal conditions, a new frame is pushed onto the stack sothat next is the same as parent.  However, aframe can also branch.  For example, the frame for stripe_writev createsone new frame for each of its subvolumes before it returns.  With threesubvolumes, next would be the same as parentonlyfor the first new frame.

 

Following the thread-library model, the call_stack_t correspondsto a thread control block.  Some of its useful fields include these.

 

.uid and gid identify who the call came from, and can beused for authentication or authorization. 

.pid identifies the process that initiated the request.  I'm notsure what this is good for, but it's there. 

.trans points to the transport structure associated with the request. This can be used on the server to implement node-based access control, orto track down requests on failed connections; I'm not sure how useful it is (oreven if it's valid) on the client side. 

.frames is a dummy frame, which points to the first real frame for therequest. 

.Either op or type identify what type ofrequest this is, but it's not clear which. 

 

There are other fields as well, but I don't know what they're for (TBD). Relatedly, I have no idea what a call_pool_t is for(TBD).

 

1.2   STACK_WIND and STACK_UNWIND

 

TheAPI equivalents of a function call and return, respectively, are STACK_WIND andSTACK_UNWIND.  (In the frame-branching case, STACK_WIND might be more likea coroutine invocation.)  If you're in a translator function and want topass a request on to the next translator, you call STACK_WIND with thefollowing parameters.

 

.frame isa pointer to your own call_frame_t. 

.rfn isa pointer to your own “second half” completion function, which will be invokedafter the nested call completes. 

.obj isa pointer to the translator instance (e.g. one per subvolume) for the nestedcall. 

.fn isa pointer to the specific translator function you're invoking (usually takenfrom the translator instance's fops dispatch table). 

 

Atthis point you (the calling translator function) can return.  It seemslike translator functions (both initial and final) always return zero. I'm not sure what the significance is of returning a non-zero value(TBD).  Execution of your function will resume at rfn.  Notethat rfn might be invoked multiple times in theframe-branching case.  Also, you can store your “local variables” in astructure pointed to by your own frame's local field. There are other ways (described below) to store context for later aswell.

 

Toreturn from a translator call - not just the initial/final C function - youcall STACK_UNWIND with a pointer to the frame you're completing (i.e. your own)and any additional parameters that should be passed to the calling frame'scompletion function.  There are also other variants of STACK_WIND andSTACK_UNWIND whose purposes are unclear (TBD).  The way this all worksis best illustrated by an example.

 

1.Let'ssay we're in foo_writev, and we want to pass a request on to thewritev function to a subvolume bar.  Thus, we might dothis: STACK_WIND(foo_frame,foo_writev_cbk,subvolume->fops->writev,...) 

2.WithinSTACK_WIND a new frame is created, populated with some of the macro values, andpushed onto our call stack.  It also increments the ref_count fieldin foo_frame. Lastly, STACK_WIND callsbar_writev

3.Forthe sake of simplicity, we'll say that bar_writev doesn't needto make any nested calls.  It does what it needs to, and then doessomething like this: STACK_UNWIND(bar_frame,0,...) 

4.STACK_UNWINDsets bar_frame->complete and decrements bar_frame->parent->ref_count (i.e. foo_frame->ref_count). 

5.STACK_UNWINDfinds the parent frame and completion-callback pointers in bar_frame,so it can callfoo_writev_cbk(foo_frame,...) 

6.There'snothing in STACK_UNWIND to handle freeing frames.  I assume these arereaped when control returns to the scheduler, based on complete and ref_count bothbeing zero (TBD). 

 

1.3   Inode and FileDescriptor Context

 

In addition to storing context in a frame's local field,there are methods for attaching a limited amount of extra information toan inode_t or fd_t.  This can be handy forinformation produced or consumed in utility functions that already havepointers to these objects but not to the frame.  These are simplemaps/dictionaries, where the key is a pointer to the translator and the valueis an unsigned 64-bit integer.  Thus, for an fd we have:

 

.fd_ctx_set (fd_t *, xlator_t *, uint64_t) 

.fd_ctx_get (fd_t *, xlator_t *, uint64_t *) 

.fd_ctx_del (fd_t *, xlator_t *, uint64_t *)4 

 

There is a similar set of functions for inodes, plus functions (e.g.inode_ctx_set2) to get/set two values instead of just one.

 

It's important to keep in mind that an fd_t or inode_t hasonly a limited number of these context slots, and only one may be used by eachtranslator.  Therefore, there are many subtle opportunities for the tableto overflow, or for code in one part of a complex translator to overwrite thevalues needed by another5, etc.  There also seems to be a bitof a bug in the implementation of these functions, in that they use a loopindex bounded by xlator->ctx->xl_count as an indexinto inode->_ctxeven though the bound could exceed the arraysize.  I think this works because xl_count never happensto be set that high and these features aren't used everywhere, but it stillseems very brittle (TBD).

 

Calling Conventions

 

STACK_WIND and STACK_UNWIND both enforce a certain consistency in thecalling conventions for translator functions, though some variation is alsopossible.  Initial functions (e.g. foo_writev) all take the same first twoarguments:

 

.frame is a pointer to the current call frame. 

.this is a pointer to the current translator.  It's not clear howthis could ever be anything other than frame->this (TBD). 

 

For completion functions (e.g. foo_writev_cbk) the convention is a bitdifferent, with four common parameters instead of two.

 

.frame is the same as for the initial function, pointing to the framethat's being resumed. 

.cookie is usually the same if STACK_WIND was used, though it could bedifferent if the STACK_WIND_COOKIE variant was used.  It's extracted fromthe frame and passed on by STACK_UNWIND, and might be another place to storesome request-specific data. 

.this is also the same as for the initial function, also pointing to theresumed frame. 

.op_ret is the return value from the nested call.  There is also op_errno,presumably mirroring the usage of return values and errno for system or libccalls. 

 

Each individual function might also take additional parameters, varying ineach case, so check xlator.h for specific details.

 

 

1http://fuse.sourceforge.net/

 

2http://www.eecs.harvard.edu/~mdw/proj/seda/

 

3 Yes, it's confusing to put a pointer tothe caller's function in the callee's frame, but that's the way it is.

 

4 Yes, the delete function fetches thevalue first, making it more of a destructive fetch than a pure delete.

 

5 Note that the key is a translator instance ratherthan a translator type, so at least multiple instances of the sametranslator won't step on each other.