redis-server 接收到客户端的第一条命令
redis-cli 给 redis-server 发送的第一条数据是 *1\r\n\$7\r\nCOMMAND\r\n 。我们来看下对于这条数据如何处理,单步调试一下 readQueryFromClient 调用 read 函数收取完数据,接着继续处理 c→querybuf 的代码即可。经实际跟踪调试,调用的是 processInputBuffer 函数,位于 networking.c 文件中:
/* This function is called every time, in the client structure 'c', there is
* more query buffer to process, because we read more data from the socket
* or because a client was blocked and later reactivated, so there could be
* pending query buffer, already representing a full command, to process. */
void processInputBuffer(client *c) {
server.current_client = c;
/* Keep processing while there is something in the input buffer */
while(sdslen(c->querybuf)) {
/* Return if clients are paused. */
if (!(c->flags & CLIENT_SLAVE) && clientsArePaused()) break;
/* Immediately abort if the client is in the middle of something. */
if (c->flags & CLIENT_BLOCKED) break;
/* CLIENT_CLOSE_AFTER_REPLY closes the connection once the reply is
* written to the client. Make sure to not let the reply grow after
* this flag has been set (i.e. don't process more commands).
*
* The same applies for clients we want to terminate ASAP. */
if (c->flags & (CLIENT_CLOSE_AFTER_REPLY|CLIENT_CLOSE_ASAP)) break;
/* Determine request type when unknown. */
if (!c->reqtype) {
if (c->querybuf[0] == '*') {
c->reqtype = PROTO_REQ_MULTIBULK;
} else {
c->reqtype = PROTO_REQ_INLINE;
}
}
if (c->reqtype == PROTO_REQ_INLINE) {
if (processInlineBuffer(c) != C_OK) break;
} else if (c->reqtype == PROTO_REQ_MULTIBULK) {
if (processMultibulkBuffer(c) != C_OK) break;
} else {
serverPanic("Unknown request type");
}
/* Multibulk processing could see a <= 0 length. */
if (c->argc == 0) {
resetClient(c);
} else {
/* Only reset the client when the command was executed. */
if (processCommand(c) == C_OK) {
if (c->flags & CLIENT_MASTER && !(c->flags & CLIENT_MULTI)) {
/* Update the applied replication offset of our master. */
c->reploff = c->read_reploff - sdslen(c->querybuf);
}
/* Don't reset the client structure for clients blocked in a
* module blocking command, so that the reply callback will
* still be able to access the client argv and argc field.
* The client will be reset in unblockClientFromModule(). */
if (!(c->flags & CLIENT_BLOCKED) || c->btype != BLOCKED_MODULE)
resetClient(c);
}
/* freeMemoryIfNeeded may flush slave output buffers. This may
* result into a slave, that may be the active client, to be
* freed. */
if (server.current_client == NULL) break;
}
}
server.current_client = NULL;
}
processInputBuffer 先判断接收到的字符串是不是以星号( * )开头,这里是以星号开头,然后设置 client 对象的 reqtype 字段值为 PROTO_REQ_MULTIBULK 类型,接着调用 processMultibulkBuffer 函数继续处理剩余的字符串。处理后的字符串被解析成 redis 命令,记录在 client 对象的 argc 和 argv 两个字段中,前者记录当前命令的数目,后者存储的是命令对应结构体对象的地址。这些命令的相关内容不是我们本课程的关注点,不再赘述。
命令解析完成以后,从 processMultibulkBuffer 函数返回,在 processCommand 函数中处理刚才记录在 client 对象 argv 字段中的命令。
//为了与原代码保持一致,代码缩进未调整
if (c->argc == 0) {
resetClient(c);
} else {
/* Only reset the client when the command was executed. */
if (processCommand(c) == C_OK) {
//省略部分代码
}
}
在 processCommand 函数中处理命令,流程大致如下:
(1)先判断是不是 quit 命令,如果是,则往发送缓冲区中添加一条应答命令( 应答 redis 客户端 ),并给当前 client 对象设置 CLIENT_CLOSE_AFTER_REPLY 标志,这个标志见名知意,即应答完毕后关闭连接。
(2)如果不是 quit 命令,则使用 lookupCommand 函数从全局命令字典表中查找相应的命令,如果出错,则向发送缓冲区中添加出错应答。出错不是指程序逻辑出错,有可能是客户端发送的非法命令。如果找到相应的命令,则执行命令后添加应答。
int processCommand(client *c) {
/* The QUIT command is handled separately. Normal command procs will
* go through checking for replication and QUIT will cause trouble
* when FORCE_REPLICATION is enabled and would be implemented in
* a regular command proc. */
if (!strcasecmp(c->argv[0]->ptr,"quit")) {
addReply(c,shared.ok);
c->flags |= CLIENT_CLOSE_AFTER_REPLY;
return C_ERR;
}
/* Now lookup the command and check ASAP about trivial error conditions
* such as wrong arity, bad command name and so forth. */
c->cmd = c->lastcmd = lookupCommand(c->argv[0]->ptr);
if (!c->cmd) {
flagTransaction(c);
addReplyErrorFormat(c,"unknown command '%s'",
(char*)c->argv[0]->ptr);
return C_OK;
} else if ((c->cmd->arity > 0 && c->cmd->arity != c->argc) ||
(c->argc < -c->cmd->arity)) {
flagTransaction(c);
addReplyErrorFormat(c,"wrong number of arguments for '%s' command",
c->cmd->name);
return C_OK;
}
//...省略部分代码
}
全局字典表是前面介绍的 server 全局变量(类型是 redisServer)的一个字段 commands 。
struct redisServer {
/* General */
pid_t pid; /* Main process pid. */
//无关字段省略
dict *commands; /* Command table */
//无关字段省略
}
至于这个全局字典表在哪里初始化以及相关的数据结构类型,由于与本课程主题无关,这里就不分析了。
下面重点探究如何将应答命令(包括出错的应答)添加到发送缓冲区去。我们以添加一个“ok”命令为例:
void addReply(client *c, robj *obj) {
if (prepareClientToWrite(c) != C_OK) return;
/* This is an important place where we can avoid copy-on-write
* when there is a saving child running, avoiding touching the
* refcount field of the object if it's not needed.
*
* If the encoding is RAW and there is room in the static buffer
* we'll be able to send the object to the client without
* messing with its page. */
if (sdsEncodedObject(obj)) {
if (_addReplyToBuffer(c,obj->ptr,sdslen(obj->ptr)) != C_OK)
_addReplyObjectToList(c,obj);
} else if (obj->encoding == OBJ_ENCODING_INT) {
/* Optimization: if there is room in the static buffer for 32 bytes
* (more than the max chars a 64 bit integer can take as string) we
* avoid decoding the object and go for the lower level approach. */
if (listLength(c->reply) == 0 && (sizeof(c->buf) - c->bufpos) >= 32) {
char buf[32];
int len;
len = ll2string(buf,sizeof(buf),(long)obj->ptr);
if (_addReplyToBuffer(c,buf,len) == C_OK)
return;
/* else... continue with the normal code path, but should never
* happen actually since we verified there is room. */
}
obj = getDecodedObject(obj);
if (_addReplyToBuffer(c,obj->ptr,sdslen(obj->ptr)) != C_OK)
_addReplyObjectToList(c,obj);
decrRefCount(obj);
} else {
serverPanic("Wrong obj->encoding in addReply()");
}
}
addReply 函数中有两个关键的地方,一个是 prepareClientToWrite 函数调用,另外一个是 _addReplyToBuffer 函数调用。先来看 prepareClientToWrite ,这个函数中有这样一段代码:
if (!clientHasPendingReplies(c) &&
!(c->flags & CLIENT_PENDING_WRITE) &&
(c->replstate == REPL_STATE_NONE ||
(c->replstate == SLAVE_STATE_ONLINE && !c->repl_put_online_on_ack)))
{
/* Here instead of installing the write handler, we just flag the
* client and put it into a list of clients that have something
* to write to the socket. This way before re-entering the event
* loop, we can try to directly write to the client sockets avoiding
* a system call. We'll only really install the write handler if
* we'll not be able to write the whole reply at once. */
c->flags |= CLIENT_PENDING_WRITE;
listAddNodeHead(server.clients_pending_write,c);
}
这段代码先判断发送缓冲区中是否还有未发送的应答命令——通过判断 client 对象的 bufpos 字段( int 型 )和 reply 字段( 这是一个链表 )的长度是否大于 0 。
/* Return true if the specified client has pending reply buffers to write to
* the socket. */
int clientHasPendingReplies(client *c) {
return c->bufpos || listLength(c->reply);
}
如果当前 client 对象不是处于 CLIENT_PENDING_WRITE 状态,且在发送缓冲区没有剩余数据,则给该 client 对象设置 CLIENT_PENDING_WRITE 标志,并将当前 client 对象添加到全局 server 对象的名叫 clients_pending_write 链表中去。这个链表中存的是所有有数据要发送的 client 对象,注意和上面说的 reply 链表区分开来。
关于 CLIENT_PENDING_WRITE 标志,redis 解释是:
send but a write handler is yet not installed
翻译成中文就是:一个有数据需要发送,但是还没有注册可写事件的 client 对象。
下面讨论 _addReplyToBuffer 函数,位于 networking.c 文件中。
int _addReplyToBuffer(client *c, const char *s, size_t len) {
size_t available = sizeof(c->buf)-c->bufpos;
if (c->flags & CLIENT_CLOSE_AFTER_REPLY) return C_OK;
/* If there already are entries in the reply list, we cannot
* add anything more to the static buffer. */
if (listLength(c->reply) > 0) return C_ERR;
/* Check that the buffer has enough space available for this string. */
if (len > available) return C_ERR;
memcpy(c->buf+c->bufpos,s,len);
c->bufpos+=len;
return C_OK;
}
在这个函数中再次确保了 client 对象的 reply 链表长度不能大于 0( if 判断,如果不满足条件,则退出该函数 )。reply 链表存储的是待发送的应答命令。应答命令被存储在 client 对象的 buf 字段中,其长度被记录在 bufpos 字段中。buf 字段是一个固定大小的字节数组:
typedef struct client {
uint64_t id; /* Client incremental unique ID. */
int fd; /* Client socket. */
redisDb *db; /* Pointer to currently SELECTed DB. */
robj *name; /* As set by CLIENT SETNAME. */
sds querybuf; /* Buffer we use to accumulate client queries. */
sds pending_querybuf; /* If this is a master, this buffer represents the
yet not applied replication stream that we
are receiving from the master. */
//省略部分字段...
/* Response buffer */
int bufpos;
char buf[PROTO_REPLY_CHUNK_BYTES];
} client;
PROTO_REPLY_CHUNK_BYTES 在 redis 中的定义是 16*1024 ,也就是说应答命令数据包最长是 16k 。
回到我们上面提的命令:*1\r\n\$7\r\nCOMMAND\r\n ,通过 lookupCommand 解析之后得到 command 命令,在 GDB 中显示如下:
2345 c->cmd = c->lastcmd = lookupCommand(c->argv[0]->ptr);
(gdb) n
2346 if (!c->cmd) {
(gdb) p c->cmd
$23 = (struct redisCommand *) 0x742db0 <redisCommandTable+13040>
(gdb) p *c->cmd
$24 = {name = 0x4fda67 "command", proc = 0x42d920 <commandCommand>, arity = 0, sflags = 0x50dc3e "lt", flags = 1536, getkeys_proc = 0x0, firstkey = 0, lastkey = 0,
keystep = 0, microseconds = 1088, calls = 1}
如何处理可写事件
前面我们介绍了 redis-server 如何处理可读事件,整个流程就是注册可读事件回调函数,在回调函数中调用操作系统 API read 函数收取数据,然后解析数据得到 redis 命令,处理命令接着将应答数据包放到 client 对象的 buf 字段中去。那么放入 buf 字段的数据何时发给客户端呢?
还记得我们前面课程提到的 while 事件循环吗?我们再来回顾一下它的代码:
void aeMain(aeEventLoop *eventLoop) {
eventLoop->stop = 0;
while (!eventLoop->stop) {
if (eventLoop->beforesleep != NULL)
eventLoop->beforesleep(eventLoop);
aeProcessEvents(eventLoop, AE_ALL_EVENTS|AE_CALL_AFTER_SLEEP);
}
}
其中,先判断 eventLoop 对象的 beforesleep 对象是否设置了,这是一个回调函数。在 redis-server 初始化时已经设置好了。
void aeSetBeforeSleepProc(aeEventLoop *eventLoop, aeBeforeSleepProc *beforesleep) {
eventLoop->beforesleep = beforesleep;
}
我们在 aeSetBeforeSleepProc 这个函数上设置一个断点,然后重启一下 redis-server 来验证在何处设置的这个回调。
Breakpoint 2, aeSetBeforeSleepProc (eventLoop=0x7ffff083a0a0, beforesleep=beforesleep@entry=0x4294f0 <beforeSleep>) at ae.c:507
507 eventLoop->beforesleep = beforesleep;
(gdb) bt
#0 aeSetBeforeSleepProc (eventLoop=0x7ffff083a0a0, beforesleep=beforesleep@entry=0x4294f0 <beforeSleep>) at ae.c:507
#1 0x00000000004238d2 in main (argc=<optimized out>, argv=0x7fffffffe588) at server.c:3892
使用 f 1 命令切换到堆栈 #1 ,并输入 l 显示断点附近的代码:
(gdb) l
3887 /* Warning the user about suspicious maxmemory setting. */
3888 if (server.maxmemory > 0 && server.maxmemory < 1024*1024) {
3889 serverLog(LL_WARNING,"WARNING: You specified a maxmemory value that is less than 1MB (current value is %llu bytes). Are you sure this is what you really want?", server.maxmemory);
3890 }
3891
3892 aeSetBeforeSleepProc(server.el,beforeSleep);
3893 aeSetAfterSleepProc(server.el,afterSleep);
3894 aeMain(server.el);
3895 aeDeleteEventLoop(server.el);
3896 return 0;
3892 行将这个回调设置成 beforeSleep 函数,因此每一轮循环都会调用这个 beforeSleep 函数。server.el 前面也介绍过即 aeEventLoop 对象,在这个 beforeSleep 函数中有一个 handleClientsWithPendingWrites 调用( 位于文件 server.c 中 ):
void beforeSleep(struct aeEventLoop *eventLoop) {
//省略无关代码...
/* Handle writes with pending output buffers. */
handleClientsWithPendingWrites();
//省略无关代码...
}
handleClientsWithPendingWrites 函数调用即把记录在每个 client 中的数据发送出去。我们具体看一下发送的逻辑( 位于 networking.c 文件中 ):
/* This function is called just before entering the event loop, in the hope
* we can just write the replies to the client output buffer without any
* need to use a syscall in order to install the writable event handler,
* get it called, and so forth. */
int handleClientsWithPendingWrites(void) {
listIter li;
listNode *ln;
int processed = listLength(server.clients_pending_write);
listRewind(server.clients_pending_write,&li);
while((ln = listNext(&li))) {
client *c = listNodeValue(ln);
c->flags &= ~CLIENT_PENDING_WRITE;
listDelNode(server.clients_pending_write,ln);
/* Try to write buffers to the client socket. */
if (writeToClient(c->fd,c,0) == C_ERR) continue;
/* If there is nothing left, do nothing. Otherwise install
* the write handler. */
if (clientHasPendingReplies(c) &&
aeCreateFileEvent(server.el, c->fd, AE_WRITABLE,
sendReplyToClient, c) == AE_ERR)
{
freeClientAsync(c);
}
}
return processed;
}
上面的代码先从全局 server 对象的 clients_pending_write 字段( 存储 client 对象的链表 )挨个取出有数据要发送的 client 对象,然后调用 writeToClient 函数尝试将 client 中存储的应答数据发出去。
//位于networking.c文件中
int writeToClient(int fd, client *c, int handler_installed) {
ssize_t nwritten = 0, totwritten = 0;
size_t objlen;
sds o;
while(clientHasPendingReplies(c)) {
if (c->bufpos > 0) {
nwritten = write(fd,c->buf+c->sentlen,c->bufpos-c->sentlen);
if (nwritten <= 0) break;
c->sentlen += nwritten;
totwritten += nwritten;
/* If the buffer was sent, set bufpos to zero to continue with
* the remainder of the reply. */
if ((int)c->sentlen == c->bufpos) {
c->bufpos = 0;
c->sentlen = 0;
}
} else {
o = listNodeValue(listFirst(c->reply));
objlen = sdslen(o);
if (objlen == 0) {
listDelNode(c->reply,listFirst(c->reply));
continue;
}
nwritten = write(fd, o + c->sentlen, objlen - c->sentlen);
if (nwritten <= 0) break;
c->sentlen += nwritten;
totwritten += nwritten;
/* If we fully sent the object on head go to the next one */
if (c->sentlen == objlen) {
listDelNode(c->reply,listFirst(c->reply));
c->sentlen = 0;
c->reply_bytes -= objlen;
/* If there are no longer objects in the list, we expect
* the count of reply bytes to be exactly zero. */
if (listLength(c->reply) == 0)
serverAssert(c->reply_bytes == 0);
}
}
/* Note that we avoid to send more than NET_MAX_WRITES_PER_EVENT
* bytes, in a single threaded server it's a good idea to serve
* other clients as well, even if a very large request comes from
* super fast link that is always able to accept data (in real world
* scenario think about 'KEYS *' against the loopback interface).
*
* However if we are over the maxmemory limit we ignore that and
* just deliver as much data as it is possible to deliver. */
if (totwritten > NET_MAX_WRITES_PER_EVENT &&
(server.maxmemory == 0 ||
zmalloc_used_memory() < server.maxmemory)) break;
}
server.stat_net_output_bytes += totwritten;
if (nwritten == -1) {
if (errno == EAGAIN) {
nwritten = 0;
} else {
serverLog(LL_VERBOSE,
"Error writing to client: %s", strerror(errno));
freeClient(c);
return C_ERR;
}
}
if (totwritten > 0) {
/* For clients representing masters we don't count sending data
* as an interaction, since we always send REPLCONF ACK commands
* that take some time to just fill the socket output buffer.
* We just rely on data / pings received for timeout detection. */
if (!(c->flags & CLIENT_MASTER)) c->lastinteraction = server.unixtime;
}
if (!clientHasPendingReplies(c)) {
c->sentlen = 0;
if (handler_installed) aeDeleteFileEvent(server.el,c->fd,AE_WRITABLE);
/* Close connection after entire reply has been sent. */
if (c->flags & CLIENT_CLOSE_AFTER_REPLY) {
freeClient(c);
return C_ERR;
}
}
return C_OK;
}
writeToClient 函数先把自己处理的 client 对象的 buf 字段的数据发出去,如果出错的话则释放这个 client 。如果数据能够全部发完,发完以后则会移除对应的 fd 上的可写事件( 如果添加了 );如果当前 client 设置了 CLIENT_CLOSE_AFTER_REPLY 标志,则发送完数据立即释放这个 client 对象。
当然,可能存在一种情况是,由于网络或者客户端的原因,redis-server 某个客户端的数据发送不出去,或者只有部分可以发出去( 例如,服务器端给客户端发数据,客户端的应用层一直不从 Tcp 内核缓冲区中取出数据,这样服务器发送一段时间的数据后,客户端内核缓冲区满了,服务器再发数据就会发不出去,由于 fd 是非阻塞的,这个时候服务器调用 send 或者 write 函数会直接返回,返回值是 −1 ,错误码是 EAGAIN ,见上面的代码。)。不管哪种情况,数据这一次发不完。这个时候就需要监听可写事件了,因为在 handleClientsWithPendingWrites 函数中有如下代码:
/* If there is nothing left, do nothing. Otherwise install
* the write handler. */
if (clientHasPendingReplies(c) && aeCreateFileEvent(server.el, c->fd, AE_WRITABLE,
sendReplyToClient, c) == AE_ERR)
{
freeClientAsync(c);
}
这里注册可写事件 AE_WRITABLE 的回调函数是 sendReplyToClient 。也就是说,当下一次某个触发可写事件时,调用的就是 sendReplyToClient 函数。可以猜想,sendReplyToClient 发送数据的逻辑和上面的 writeToClient 函数一模一样,不信请看( 位于 networking.c 文件中 ):
/* Write event handler. Just send data to the client. */
void sendReplyToClient(aeEventLoop *el, int fd, void *privdata, int mask) {
UNUSED(el);
UNUSED(mask);
writeToClient(fd,privdata,1);
}
至此,redis-server 发送数据的逻辑也理清楚了。这里简单做个总结:
如果有数据要发送给某个 client ,不需要专门注册可写事件等触发可写事件再发送。通常的做法是在应答数据产生的地方直接发送,如果是因为对端 Tcp 窗口太小引起的发送不完,则将剩余的数据存储至某个缓冲区并注册监听可写事件,等下次触发可写事件后再尝试发送,一直到数据全部发送完毕后移除可写事件。
redis-server 数据的发送逻辑与这个稍微有点差别,就是将数据发送的时机放到了 EventLoop 的某个时间点上( 这里是在 ProcessEvents 之前 ),其他的与上面完全一样。
之所以不注册监听可写事件,等可写事件触发再发送数据,原因是通常情况下,网络通信的两端数据一般都是正常收发的,不会出现某一端由于 Tcp 窗口太小而使另外一端发不出去的情况。如果注册监听可写事件,那么这个事件会频繁触发,而触发时不一定有数据需要发送,这样不仅浪费系统资源,同时也浪费服务器程序宝贵的 CPU 时间片。