阅读的版本为6.0.5

stream是redis对消息队列比较完备的实现。

下面图即为stream的主要的数据结构,可以看到下面的1-20即为其日志的数据,不过在stream中这些日志数据不是通过顺序表来存储的,而是通过rax这个基数树的数据结构来进行存储的。

stream除了维护了日志数据,还维护了一系列的ConsumeGroup数据,这些ConsumeGroup也是通过rax这个数据结构来进行存储的,每一个ConsumeGroup主要有last_id表示这个消费组消费的坐标,perl表示已经发送给consume,还没有确认的消息数据,以及consumes表示所有的消费者。

redis stream只处理一次 redis stream 原理_redis stream只处理一次

stream的数据结构

下面可以看到stream的数据结构,可以看到stream主要是下面四条数据

  • rax: 这个基数树存储的即为stream中的日志数据,其key为日志id,value是一个listPack结构的数据存储消息
  • length: 表示这个stream有多少个的元素
  • last_id: 这表示当前的数据数据的最后的id
  • cgroups: 这个基数树存储的是这个stream的所有消费组,它其实可以看做一个dict,它的key是消费组的名字,value是StreamCG的数据结构存储这个消费组的数据
typedef struct stream {
    rax *rax;               /* The radix tree holding the stream. */
    uint64_t length;        /* Number of elements inside this stream. */
    streamID last_id;       /* Zero if there are yet no items. */
    rax *cgroups;           /* Consumer groups dictionary: name -> streamCG */
} stream;
rax的数据结构

下面图展示了日志的数据结构,它是通过rax(基数树)来进行存储的,而其存储的key为一个[ms,seq],两个64位数合成的一个128位的数为日志的id,这个128位的数字通过16个字节来进行存储到rax中,然后其节点是一个listpack的数据结构的数据,这个listpack会存储多条日志数据。

可以看到下面的图展示了一下它的数据结构,可以看到其存储两个节点的数据,分别为

000001797D997BA30000000000000000和000001797D997BA30000000000000001,然后其对应的key节点最终都指向了一个listpack结构的数据。

redis stream只处理一次 redis stream 原理_迭代器_02

listpack结构

下面代码展示了listpack的数据结构。

需要注意的是对于listpack中的每条数据都是由两部分组成的,前半部分是这条数据的编码,后半部分则是这条数据所占用的字节数。这个后半部分记录这条数据所占用的字节数的存储方式是利用每个字节的第一位来表示向前还有吗数据来表示总长度,当前字节的最高位是1表示前面还有字节,是0的话则表示前面没有字节。

下面来具体介绍一下各个字段的含义

  • count: 表示这个listpack中有效的节点数
  • deleted: 表示这个listpack中无效的节点数

然后这个listpack则主要分为两块数据。

一块是下图中的红色的数据,表示的是这个listpack的主的field的数据,它是第一个插入这个listpack的field,当后续的日志插入这个listpack时,如果其field和这个住field的数据完全相同的话,则其只需要存储value而不需要存field,下面图的黄色部分就只存储了value数据而没有存储field数据。

另一块则是黄色和绿色的部分,这块存储的则是一条一条日志数据,如果单条日志数据的key和红色部分的主field完全一致的话,其flags部分则会设置STREAM_ITEM_FLAG_SAMEFIELDS标志,然后其数据区域则只会存value而不会存field。

flags主要是存储的是这条日志的一些标志,如上面的STREAM_ITEM_FLAG_SAMEFIELDS,当这条日志被删除时,其flags上则会加上STREAM_ITEM_FLAG_DELETED标识。

然后每条日志上都会由一个entry-id,这个entry-id存储的不是这条日志完整的id,而是存储的是这个listpack的对应的在rax上的key的偏移量

最后一个lp_count存储的则是单条日志的中的<filed,value>的数量,主要是用于从右往左对这个listpack进行遍历。

redis stream只处理一次 redis stream 原理_迭代器_03

源码分析

streamAppendItem

这个方法是往rax中加入新的日志数据,其主要的操作就是生成新的id,然后找当前rax的最后一个id对应的listpack,将新的日志数据加入到这个listpack中。

加入listpack的操作则主要是和主的field进行比较,确定需不需要存储field值,然后按照结构把数据加入到listpack后即可。

int streamAppendItem(stream *s, robj **argv, int64_t numfields, streamID *added_id, streamID *use_id) {
    
    //生成新的id
    streamID id;
    if (use_id)
        id = *use_id;
    else
        streamNextID(&s->last_id,&id);
    if (streamCompareID(&id,&s->last_id) <= 0) return C_ERR;

    /* Add the new entry. */
    raxIterator ri;
    //创建raxIterator
    raxStart(&ri,s->rax);
    //迭代器走到最右边节点的位置
    raxSeek(&ri,"$",NULL,0);

    size_t lp_bytes = 0;        /* Total bytes in the tail listpack. */
    unsigned char *lp = NULL;   /* Tail listpack pointer. */

    /* Get a reference to the tail node listpack. */
    //找到最后一个节点的litpack
    if (raxNext(&ri)) {
        lp = ri.data;
        lp_bytes = lpBytes(lp);
    }
    raxStop(&ri);

    /* We have to add the key into the radix tree in lexicographic order,
     * to do so we consider the ID as a single 128 bit number written in
     * big endian, so that the most significant bytes are the first ones. */
    uint64_t rax_key[2];    /* Key in the radix tree containing the listpack.*/
    streamID master_id;     /* ID of the master entry in the listpack. */

    //对于单个listpack配置了最大字节或者最大节点的,
    //判单当前的listpack是否超过对应的节点数量
    if (lp != NULL) {
        if (server.stream_node_max_bytes &&
            lp_bytes >= server.stream_node_max_bytes)
        {
            lp = NULL;
        } else if (server.stream_node_max_entries) {
            int64_t count = lpGetInteger(lpFirst(lp));
            if (count >= server.stream_node_max_entries) lp = NULL;
        }
    }
    int flags = STREAM_ITEM_FLAG_NONE;
    //这个节点的数据已经超过了上限或者当前节点没有listpack数据
    //创建一个新的listpack加入到rax中
    if (lp == NULL || lp_bytes >= server.stream_node_max_bytes) {
        master_id = id;
        streamEncodeID(rax_key,&id);
        /* Create the listpack having the master entry ID and fields. */
        lp = lpNew();
        lp = lpAppendInteger(lp,1); /* One item, the one we are adding. */
        lp = lpAppendInteger(lp,0); /* Zero deleted so far. */
        lp = lpAppendInteger(lp,numfields);
        //加入主filed数据
        for (int64_t i = 0; i < numfields; i++) {
            sds field = argv[i*2]->ptr;
            lp = lpAppend(lp,(unsigned char*)field,sdslen(field));
        }
        lp = lpAppendInteger(lp,0); /* Master entry zero terminator. */
        //把当前的listpack加入到rax中
        raxInsert(s->rax,(unsigned char*)&rax_key,sizeof(rax_key),lp,NULL);
        /* The first entry we insert, has obviously the same fields of the
         * master entry. */
        flags |= STREAM_ITEM_FLAG_SAMEFIELDS;
    } else {
        serverAssert(ri.key_len == sizeof(rax_key));
        memcpy(rax_key,ri.key,sizeof(rax_key));

        /* Read the master ID from the radix tree key. */
        streamDecodeID(rax_key,&master_id);
        unsigned char *lp_ele = lpFirst(lp);

        /* Update count and skip the deleted fields. */
        int64_t count = lpGetInteger(lp_ele);
        lp = lpReplaceInteger(lp,&lp_ele,count+1);
        lp_ele = lpNext(lp,lp_ele); /* seek deleted. */
        lp_ele = lpNext(lp,lp_ele); /* seek master entry num fields. */

        /* Check if the entry we are adding, have the same fields
         * as the master entry. */
        int64_t master_fields_count = lpGetInteger(lp_ele);
        lp_ele = lpNext(lp,lp_ele);
        //比较当前新加入的节点的field是否和主field完全一致
        if (numfields == master_fields_count) {
            int64_t i;
            for (i = 0; i < master_fields_count; i++) {
                sds field = argv[i*2]->ptr;
                int64_t e_len;
                unsigned char buf[LP_INTBUF_SIZE];
                unsigned char *e = lpGet(lp_ele,&e_len,buf);
                /* Stop if there is a mismatch. */
                if (sdslen(field) != (size_t)e_len ||
                    memcmp(e,field,e_len) != 0) break;
                lp_ele = lpNext(lp,lp_ele);
            }
            /* All fields are the same! We can compress the field names
             * setting a single bit in the flags. */
            //所有的key都是相同的
            if (i == master_fields_count) flags |= STREAM_ITEM_FLAG_SAMEFIELDS;
        }
    }
    //插入flags数据
    lp = lpAppendInteger(lp,flags);
    lp = lpAppendInteger(lp,id.ms - master_id.ms);
    lp = lpAppendInteger(lp,id.seq - master_id.seq);
    if (!(flags & STREAM_ITEM_FLAG_SAMEFIELDS))
        lp = lpAppendInteger(lp,numfields);
    for (int64_t i = 0; i < numfields; i++) {
        sds field = argv[i*2]->ptr, value = argv[i*2+1]->ptr;
        if (!(flags & STREAM_ITEM_FLAG_SAMEFIELDS))
            lp = lpAppend(lp,(unsigned char*)field,sdslen(field));
        lp = lpAppend(lp,(unsigned char*)value,sdslen(value));
    }
    /* Compute and store the lp-count field. */
    int64_t lp_count = numfields;
    lp_count += 3; /* Add the 3 fixed fields flags + ms-diff + seq-diff. */
    if (!(flags & STREAM_ITEM_FLAG_SAMEFIELDS)) {
        /* If the item is not compressed, it also has the fields other than
         * the values, and an additional num-fileds field. */
        lp_count += numfields+1;
    }
    lp = lpAppendInteger(lp,lp_count);
    //当前的lp可能会因为重新分配内存而导致了其地址改变则改变其在rax中的指针
    if (ri.data != lp)
        raxInsert(s->rax,(unsigned char*)&rax_key,sizeof(rax_key),lp,NULL);
    s->length++;
    s->last_id = id;
    if (added_id) *added_id = id;
    return C_OK;
}
streamDeleteItem

stream删除一条数据的操作比较简单,通过迭代器找到对应的id所在的日志的位置,把日志位置的数据的标识设置为已经删除。

int streamDeleteItem(stream *s, streamID *id) {
    int deleted = 0;
    streamIterator si;
    streamIteratorStart(&si,s,id,id,0);
    streamID myid;
    int64_t numfields;
    if (streamIteratorGetID(&si,&myid,&numfields)) {
        //此处进行主要的删除操作,其主要的操作即为将这条消息在listpack中的标识设置为true
        streamIteratorRemoveEntry(&si,&myid);
        deleted = 1;
    }
    streamIteratorStop(&si);
    return deleted;
}
streamIterator

stream的数据迭代器在stream中的很多地方都能够遇到,其内维护了一个raxIterator,这个raxIterator则是对于rax的迭代。

对于其数据的迭代一般是使用下面这种方式,首先streamIteratorStart找到所要开始迭代的位置,以及其迭代的范围
streamIteratorGetID 方法则是找下一个在rax节点的id对应的数据,streamIteratorGetField则是迭代这个rax的节点的listpack数据结构中的数据。
最后streamIteratorStop方法则是对迭代器的数据的释放

streamIteratorStart(&myiterator,...);
 while(streamIteratorGetID(&myiterator,&ID,&numfields)) {
     while(numfields--) {
           unsigned char *key, *value;
           size_t key_len, value_len;
           streamIteratorGetField(&myiterator,&key,&value,&key_len,&value_len);
           ...
       }
 }
 streamIteratorStop(&myiterator);
streamIteratorStart

streamIteratorStart主要是确定这个迭代器的范围,然后找到调用raxSeek操作来初始化rax迭代器,以通过raxNext方法来迭代listpack方法。

void streamIteratorStart(streamIterator *si, stream *s, streamID *start, streamID *end, int rev) {
    //确定迭代器的开始key
    if (start) {
        streamEncodeID(si->start_key,start);
    } else {
        si->start_key[0] = 0;
        si->start_key[1] = 0;
    }
    //确定结束key
    if (end) {
        streamEncodeID(si->end_key,end);
    } else {
        si->end_key[0] = UINT64_MAX;
        si->end_key[1] = UINT64_MAX;
    }
    //开启raxx的迭代器,rax的迭代器是以raxStart开始,然后通过raxSeek方法找最开始的数据位置的
    raxStart(&si->ri,s->rax);
    if (!rev) {
        //此处是正向遍历
        if (start && (start->ms || start->seq)) {
            //找小于等于startKey大的位置
            raxSeek(&si->ri,"<=",(unsigned char*)si->start_key,
                    sizeof(si->start_key));
            //没找到最后一个小于等于start_key的位置,则找开头的位置        
            if (raxEOF(&si->ri)) raxSeek(&si->ri,"^",NULL,0);
        } else {
            raxSeek(&si->ri,"^",NULL,0);
        }
    } else {
        if (end && (end->ms || end->seq)) {
            raxSeek(&si->ri,"<=",(unsigned char*)si->end_key,
                    sizeof(si->end_key));
            if (raxEOF(&si->ri)) raxSeek(&si->ri,"$",NULL,0);
        } else {
            raxSeek(&si->ri,"$",NULL,0);
        }
    }
    si->stream = s;
    si->lp = NULL; /* There is no current listpack right now. */
    si->lp_ele = NULL; /* Current listpack cursor. */
    si->rev = rev;  /* Direction, if non-zero reversed, from end to start. */
}
streamIteratorGetID

这个方法主要是确定当前key在listpack的位置,因为一个listpack可能会存储多条数据,而需要开始的迭代的id可能在一个listpack的中间位置,这就需要streamIteratorGetID来迭代到对应的位置,这个方法也会在一个listpack迭代完后继续向下迭代到下一个rax的下一个节点中。

int streamIteratorGetID(streamIterator *si, streamID *id, int64_t *numfields) {
    while(1) { /* Will stop when element > stop_key or end of radix tree. */
        /* If the current listpack is set to NULL, this is the start of the
         * iteration or the previous listpack was completely iterated.
         * Go to the next node. */
        if (si->lp == NULL || si->lp_ele == NULL) {
            //rax迭代器找下一个节点
            if (!si->rev && !raxNext(&si->ri)) return 0;
            else if (si->rev && !raxPrev(&si->ri)) return 0;
            serverAssert(si->ri.key_len == sizeof(streamID));
            /* Get the master ID. */
            streamDecodeID(si->ri.key,&si->master_id);
            /* Get the master fields count. */
            si->lp = si->ri.data;
            si->lp_ele = lpFirst(si->lp);           /* Seek items count */
            si->lp_ele = lpNext(si->lp,si->lp_ele); /* Seek deleted count. */
            si->lp_ele = lpNext(si->lp,si->lp_ele); /* Seek num fields. */
            si->master_fields_count = lpGetInteger(si->lp_ele);
            si->lp_ele = lpNext(si->lp,si->lp_ele); /* Seek first field. */
            si->master_fields_start = si->lp_ele;
            /* We are now pointing to the first field of the master entry.
             * We need to seek either the first or the last entry depending
             * on the direction of the iteration. */
            if (!si->rev) {
                /* If we are iterating in normal order, skip the master fields
                 * to seek the first actual entry. */
                for (uint64_t i = 0; i < si->master_fields_count; i++)
                    si->lp_ele = lpNext(si->lp,si->lp_ele);
            } else {
                /* If we are iterating in reverse direction, just seek the
                 * last part of the last entry in the listpack (that is, the
                 * fields count). */
                si->lp_ele = lpLast(si->lp);
            }
        } else if (si->rev) {
            /* If we are iterating in the reverse order, and this is not
             * the first entry emitted for this listpack, then we already
             * emitted the current entry, and have to go back to the previous
             * one. */
            int lp_count = lpGetInteger(si->lp_ele);
            while(lp_count--) si->lp_ele = lpPrev(si->lp,si->lp_ele);
            /* Seek lp-count of prev entry. */
            si->lp_ele = lpPrev(si->lp,si->lp_ele);
        }

        /* For every radix tree node, iterate the corresponding listpack,
         * returning elements when they are within range. */
        while(1) {
            if (!si->rev) {
                /* If we are going forward, skip the previous entry
                 * lp-count field (or in case of the master entry, the zero
                 * term field) */
                si->lp_ele = lpNext(si->lp,si->lp_ele);
                if (si->lp_ele == NULL) break;
            } else {
                /* If we are going backward, read the number of elements this
                 * entry is composed of, and jump backward N times to seek
                 * its start. */
                int64_t lp_count = lpGetInteger(si->lp_ele);
                if (lp_count == 0) { /* We reached the master entry. */
                    si->lp = NULL;
                    si->lp_ele = NULL;
                    break;
                }
                while(lp_count--) si->lp_ele = lpPrev(si->lp,si->lp_ele);
            }

            /* Get the flags entry. */
            si->lp_flags = si->lp_ele;
            int flags = lpGetInteger(si->lp_ele);
            si->lp_ele = lpNext(si->lp,si->lp_ele); /* Seek ID. */

            /* Get the ID: it is encoded as difference between the master
             * ID and this entry ID. */
            *id = si->master_id;
            id->ms += lpGetInteger(si->lp_ele);
            si->lp_ele = lpNext(si->lp,si->lp_ele);
            id->seq += lpGetInteger(si->lp_ele);
            si->lp_ele = lpNext(si->lp,si->lp_ele);
            unsigned char buf[sizeof(streamID)];
            streamEncodeID(buf,id);

            /* The number of entries is here or not depending on the
             * flags. */
            if (flags & STREAM_ITEM_FLAG_SAMEFIELDS) {
                *numfields = si->master_fields_count;
            } else {
                *numfields = lpGetInteger(si->lp_ele);
                si->lp_ele = lpNext(si->lp,si->lp_ele);
            }

            /* If current >= start, and the entry is not marked as
             * deleted, emit it. */
            if (!si->rev) {
                if (memcmp(buf,si->start_key,sizeof(streamID)) >= 0 &&
                    !(flags & STREAM_ITEM_FLAG_DELETED))
                {
                    if (memcmp(buf,si->end_key,sizeof(streamID)) > 0)
                        return 0; /* We are already out of range. */
                    si->entry_flags = flags;
                    if (flags & STREAM_ITEM_FLAG_SAMEFIELDS)
                        si->master_fields_ptr = si->master_fields_start;
                    return 1; /* Valid item returned. */
                }
            } else {
                if (memcmp(buf,si->end_key,sizeof(streamID)) <= 0 &&
                    !(flags & STREAM_ITEM_FLAG_DELETED))
                {
                    if (memcmp(buf,si->start_key,sizeof(streamID)) < 0)
                        return 0; /* We are already out of range. */
                    si->entry_flags = flags;
                    if (flags & STREAM_ITEM_FLAG_SAMEFIELDS)
                        si->master_fields_ptr = si->master_fields_start;
                    return 1; /* Valid item returned. */
                }
            }

            /* If we do not emit, we have to discard if we are going
             * forward, or seek the previous entry if we are going
             * backward. */
            if (!si->rev) {
                int64_t to_discard = (flags & STREAM_ITEM_FLAG_SAMEFIELDS) ?
                                      *numfields : *numfields*2;
                for (int64_t i = 0; i < to_discard; i++)
                    si->lp_ele = lpNext(si->lp,si->lp_ele);
            } else {
                int64_t prev_times = 4; /* flag + id ms + id seq + one more to
                                           go back to the previous entry "count"
                                           field. */
                /* If the entry was not flagged SAMEFIELD we also read the
                 * number of fields, so go back one more. */
                if (!(flags & STREAM_ITEM_FLAG_SAMEFIELDS)) prev_times++;
                while(prev_times--) si->lp_ele = lpPrev(si->lp,si->lp_ele);
            }
        }

        /* End of listpack reached. Try the next/prev radix tree node. */
    }
}
streamIteratorGetField

这个方法则是迭代当前的listpack的下一条数据。

void streamIteratorGetField(streamIterator *si, unsigned char **fieldptr, unsigned char **valueptr, int64_t *fieldlen, int64_t *valuelen) {
    if (si->entry_flags & STREAM_ITEM_FLAG_SAMEFIELDS) {
        *fieldptr = lpGet(si->master_fields_ptr,fieldlen,si->field_buf);
        si->master_fields_ptr = lpNext(si->lp,si->master_fields_ptr);
    } else {
        *fieldptr = lpGet(si->lp_ele,fieldlen,si->field_buf);
        si->lp_ele = lpNext(si->lp,si->lp_ele);
    }
    *valueptr = lpGet(si->lp_ele,valuelen,si->value_buf);
    si->lp_ele = lpNext(si->lp,si->lp_ele);
}