内存碎片动态回收

在redis4版本中,新加入了内存碎片动态回收特性,该特性支持动态的将内存碎片进行回收,该功能的主要是运行redis压缩一些小空间和未利用的空闲空间,从而允许内存回收。

通常情况下出现内存碎片是每一个内存分配器都会碰到这个问题并且占用额外资源,平常情况下只需要重启服务就可以降低内存的碎片率,或者将所有数据都先迁移走然后等数据删除完成之后再重新迁移回来。因为如上的原因redis提供了一种在服务运行中也可以内存整理的方法。

基本流程是当碎片率超过一个比率之后,redis会通过jemalloc的特性来重新申请一片连续的地址然后将数据拷贝到连续的地址上面去,最后将旧数据的地址进行释放,从而完成碎片整理。

内存整理流程

本文基于redis-6.0.10,在以前的redis的事情循环当中,会出现一个定时事件的注册。

/* Create the timer callback, this is our way to process many background
     * operations incrementally, like clients timeout, eviction of unaccessed
     * expired keys and so forth. */
    if (aeCreateTimeEvent(server.el, 1, serverCron, NULL, NULL) == AE_ERR) {
        serverPanic("Can't create event loop timers.");
        exit(1);
    }

在serverCron函数中就会出现,databasesCron函数,在该函数中才会出现activeDefragCycle动态回收整理函数。

int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
		...
    /* We need to do a few operations on clients asynchronously. */
    clientsCron();

    /* Handle background operations on Redis databases. */
    databasesCron();
		...
}	


void databasesCron(void) {
		...
		
    /* Defrag keys gradually. */
    activeDefragCycle();
    ...

}

最主要的整理逻辑都在activeDefragCycle函数中。

activeDefragCycle函数
/* Perform incremental defragmentation work from the serverCron.
 * This works in a similar way to activeExpireCycle, in the sense that
 * we do incremental work across calls. */
void activeDefragCycle(void) {
    static int current_db = -1;
    static unsigned long cursor = 0;
    static redisDb *db = NULL;
    static long long start_scan, start_stat;
    unsigned int iterations = 0;
    unsigned long long prev_defragged = server.stat_active_defrag_hits;
    unsigned long long prev_scanned = server.stat_active_defrag_scanned;
    long long start, timelimit, endtime;
    mstime_t latency;
    int quit = 0;

    if (!server.active_defrag_enabled) {   // 检查当前是否开启了内存动态回收,如果关闭了
        if (server.active_defrag_running) {   // 检查当前是否正在进行碎片整理
            /* if active defrag was disabled mid-run, start from fresh next time. */
            server.active_defrag_running = 0;  //如果正在整理则重置
            if (db)
                listEmpty(db->defrag_later);
            defrag_later_current_key = NULL;
            defrag_later_cursor = 0;
            current_db = -1;
            cursor = 0;
            db = NULL;
        }
        return;   // 如果禁止碎片整理则直接返回
    }

    if (hasActiveChildProcess())   // 如果当前有子进程再运行则不进行内存整理,因为会破坏数据
        return; /* Defragging memory while there's a fork will just do damage. */

    /* Once a second, check if the fragmentation justfies starting a scan
     * or making it more aggressive. */
    run_with_period(1000) {
        computeDefragCycles();  // 检查当前是否需要进行内存整理
    }
    if (!server.active_defrag_running)  // 如果当前不需要内存整理则直接返回
        return;

    /* See activeExpireCycle for how timelimit is handled. */
    start = ustime();  // 记录开始时间并计算当前可用于内存整理的时间
    timelimit = 1000000*server.active_defrag_running/server.hz/100;
    if (timelimit <= 0) timelimit = 1;
    endtime = start + timelimit;
    latencyStartMonitor(latency);

    do {
        /* if we're not continuing a scan from the last call or loop, start a new one */
        if (!cursor) {   // 如果当前的cursor不为0则证明已经进入了循环
            /* finish any leftovers from previous db before moving to the next one */
            if (db && defragLaterStep(db, endtime)) {
                quit = 1; /* time is up, we didn't finish all the work */  // 如果已经处理超时则退出本次处理
                break; /* this will exit the function and we'll continue on the next cycle */
            }

            /* Move on to next database, and stop if we reached the last one. */
            if (++current_db >= server.dbnum) {  // 如果当前的db处理完成则进行下一个db的整理
                /* defrag other items not part of the db / keys */
                defragOtherGlobals();

                long long now = ustime();
                size_t frag_bytes;
                float frag_pct = getAllocatorFragmentation(&frag_bytes);
                serverLog(LL_VERBOSE,
                    "Active defrag done in %dms, reallocated=%d, frag=%.0f%%, frag_bytes=%zu",
                    (int)((now - start_scan)/1000), (int)(server.stat_active_defrag_hits - start_stat), frag_pct, frag_bytes);

                start_scan = now;
                current_db = -1;
                cursor = 0;
                db = NULL;
                server.active_defrag_running = 0;

                computeDefragCycles(); /* if another scan is needed, start it right away */
                if (server.active_defrag_running != 0 && ustime() < endtime)
                    continue;
                break;
            }
            else if (current_db==0) {
                /* Start a scan from the first database. */
                start_scan = ustime();
                start_stat = server.stat_active_defrag_hits;
            }

            db = &server.db[current_db];
            cursor = 0;
        }

        do {
            /* before scanning the next bucket, see if we have big keys left from the previous bucket to scan */
            if (defragLaterStep(db, endtime)) {  // 进行key处理,如果时间到了则停止
                quit = 1; /* time is up, we didn't finish all the work */
                break; /* this will exit the function and we'll continue on the next cycle */
            }

            cursor = dictScan(db->dict, cursor, defragScanCallback, defragDictBucketCallback, db);  // 如果时间够则继续进行循环遍历

            /* Once in 16 scan iterations, 512 pointer reallocations. or 64 keys
             * (if we have a lot of pointers in one hash bucket or rehasing),
             * check if we reached the time limit.
             * But regardless, don't start a new db in this loop, this is because after
             * the last db we call defragOtherGlobals, which must be done in one cycle */
            if (!cursor || (++iterations > 16 ||
                            server.stat_active_defrag_hits - prev_defragged > 512 ||
                            server.stat_active_defrag_scanned - prev_scanned > 64)) {
                if (!cursor || ustime() > endtime) {  // 检查是否达到了超时的条件
                    quit = 1;
                    break;
                }
                iterations = 0;
                prev_defragged = server.stat_active_defrag_hits;  
                prev_scanned = server.stat_active_defrag_scanned;
            }
        } while(cursor && !quit);
    } while(!quit);

    latencyEndMonitor(latency);
    latencyAddSampleIfNeeded("active-defrag-cycle",latency);  // 如果遍历完成则添加监控数据
}

总结一下大概的流程如下:

  1. 检查当前是否需要内存回收。
  2. 如果当前有上次正在执行的内存回收当前就按照上次执行的位置继续执行,如果没有则从头开始遍历整理。
  3. 依次遍历每个数据库进行内存回收整理。如果到达执行时间则停止。
  4. 如果整个数据库遍历完成则添加监控信息。

主要的流程可以概述如上,其中最主要的操作逻辑都落在了在dictScan过程中的回调函数defragScanCallback和defragDictBucketCallback。dictScan的简单流程如下:

unsigned long dictScan(dict *d,
                       unsigned long v,
                       dictScanFunction *fn,
                       dictScanBucketFunction* bucketfn,
                       void *privdata)
{
    dictht *t0, *t1;
    const dictEntry *de, *next;
    unsigned long m0, m1;

    if (dictSize(d) == 0) return 0;

    /* Having a safe iterator means no rehashing can happen, see _dictRehashStep.
     * This is needed in case the scan callback tries to do dictFind or alike. */
    d->iterators++;

    if (!dictIsRehashing(d)) {   // 是否在rehash
        t0 = &(d->ht[0]);
        m0 = t0->sizemask;

        /* Emit entries at cursor */
        if (bucketfn) bucketfn(privdata, &t0->table[v & m0]);  // 如果有回调先处理第一个de的数据
        de = t0->table[v & m0];   // 获取字典起始位置
        while (de) {
            next = de->next;
            fn(privdata, de);     // 通过回调函数依次执行遍历
            de = next;
        }

        /* Set unmasked bits so incrementing the reversed cursor
         * operates on the masked bits */
        v |= ~m0;

        /* Increment the reverse cursor */
        v = rev(v);
        v++;
        v = rev(v);

    } else {
        t0 = &d->ht[0];
        t1 = &d->ht[1];

        /* Make sure t0 is the smaller and t1 is the bigger table */
        if (t0->size > t1->size) {
            t0 = &d->ht[1];
            t1 = &d->ht[0];
        }

        m0 = t0->sizemask;
        m1 = t1->sizemask;

        /* Emit entries at cursor */
        if (bucketfn) bucketfn(privdata, &t0->table[v & m0]);
        de = t0->table[v & m0];
        while (de) {
            next = de->next;
            fn(privdata, de);
            de = next;
        }

        /* Iterate over indices in larger table that are the expansion
         * of the index pointed to by the cursor in the smaller table */
        do {
            /* Emit entries at cursor */
            if (bucketfn) bucketfn(privdata, &t1->table[v & m1]);
            de = t1->table[v & m1];
            while (de) {
                next = de->next;
                fn(privdata, de);
                de = next;
            }

            /* Increment the reverse cursor not covered by the smaller mask.*/
            v |= ~m1;
            v = rev(v);
            v++;
            v = rev(v);

            /* Continue while bits covered by mask difference is non-zero */
        } while (v & (m0 ^ m1));
    }

    /* undo the ++ at the top */
    d->iterators--;

    return v;
}

这其中有关迭代的算法,大家有兴趣可自行查阅。

defragScanCallback函数
/* Defrag scan callback for the main db dictionary. */
void defragScanCallback(void *privdata, const dictEntry *de) {
    long defragged = defragKey((redisDb*)privdata, (dictEntry*)de); //  根据不同的key类型进行新地址的申请
    server.stat_active_defrag_hits += defragged;
    if(defragged)
        server.stat_active_defrag_key_hits++;
    else
        server.stat_active_defrag_key_misses++;
    server.stat_active_defrag_scanned++;
}

主要调用了defragKey函数来根据不同的数据类型进行数据的迭代的。

/* for each key we scan in the main dict, this function will attempt to defrag
 * all the various pointers it has. Returns a stat of how many pointers were
 * moved. */
long defragKey(redisDb *db, dictEntry *de) {
    sds keysds = dictGetKey(de);   // 获取keys
    robj *newob, *ob;
    unsigned char *newzl;
    long defragged = 0;
    sds newsds;

    /* Try to defrag the key name. */
    newsds = activeDefragSds(keysds);   // 重新调整该key的内存
    if (newsds)
        defragged++, de->key = newsds;
    if (dictSize(db->expires)) {        // 检查是否有过期的keys
         /* Dirty code:
          * I can't search in db->expires for that key after i already released
          * the pointer it holds it won't be able to do the string compare */
        uint64_t hash = dictGetHash(db->dict, de->key);
        replaceSatelliteDictKeyPtrAndOrDefragDictEntry(db->expires, keysds, newsds, hash, &defragged);
    }

    /* Try to defrag robj and / or string value. */
    ob = dictGetVal(de);    // 获取当前的val
    if ((newob = activeDefragStringOb(ob, &defragged))) {
        de->v.val = newob;
        ob = newob;
    }

    if (ob->type == OBJ_STRING) {   // 判断val的类型
        /* Already handled in activeDefragStringOb. */
    } else if (ob->type == OBJ_LIST) {   // 如果为列表
        if (ob->encoding == OBJ_ENCODING_QUICKLIST) {
            defragged += defragQuicklist(db, de);   // 重置列表
        } else if (ob->encoding == OBJ_ENCODING_ZIPLIST) {
            if ((newzl = activeDefragAlloc(ob->ptr)))
                defragged++, ob->ptr = newzl;
        } else {
            serverPanic("Unknown list encoding");
        }
    } else if (ob->type == OBJ_SET) {   // 如果是字典则重置字典
        if (ob->encoding == OBJ_ENCODING_HT) {
            defragged += defragSet(db, de);
        } else if (ob->encoding == OBJ_ENCODING_INTSET) {
            intset *newis, *is = ob->ptr;
            if ((newis = activeDefragAlloc(is)))
                defragged++, ob->ptr = newis;
        } else {
            serverPanic("Unknown set encoding");
        }
    } else if (ob->type == OBJ_ZSET) {   
        if (ob->encoding == OBJ_ENCODING_ZIPLIST) {
            if ((newzl = activeDefragAlloc(ob->ptr)))
                defragged++, ob->ptr = newzl;
        } else if (ob->encoding == OBJ_ENCODING_SKIPLIST) {
            defragged += defragZsetSkiplist(db, de);
        } else {
            serverPanic("Unknown sorted set encoding");
        }
    } else if (ob->type == OBJ_HASH) {   // 如果是hash则重置hash
        if (ob->encoding == OBJ_ENCODING_ZIPLIST) {
            if ((newzl = activeDefragAlloc(ob->ptr)))
                defragged++, ob->ptr = newzl;
        } else if (ob->encoding == OBJ_ENCODING_HT) {
            defragged += defragHash(db, de);
        } else {
            serverPanic("Unknown hash encoding");
        }
    } else if (ob->type == OBJ_STREAM) {
        defragged += defragStream(db, de);
    } else if (ob->type == OBJ_MODULE) {
        /* Currently defragmenting modules private data types
         * is not supported. */
    } else {
        serverPanic("Unknown object type");
    }
    return defragged;
}

简单以hash为例,继续分析内存。

long defragHash(redisDb *db, dictEntry *kde) {
    long defragged = 0;
    robj *ob = dictGetVal(kde);   // 获取数据
    dict *d, *newd;
    serverAssert(ob->type == OBJ_HASH && ob->encoding == OBJ_ENCODING_HT);
    d = ob->ptr;
    if (dictSize(d) > server.active_defrag_max_scan_fields)  // 检查是否超过配置的最大长度
        defragLater(db, kde);   // 如果超过长度则重新放入列表等待下一次进行遍历
    else
        defragged += activeDefragSdsDict(d, DEFRAG_SDS_DICT_VAL_IS_SDS);
    /* handle the dict struct */
    if ((newd = activeDefragAlloc(ob->ptr)))   // 重新指向该空间
        defragged++, ob->ptr = newd;
    /* defrag the dict tables */
    defragged += dictDefragTables(ob->ptr);   // 重新申请该table的空间
    return defragged;
}


/* Defrag helper for dict main allocations (dict struct, and hash tables).
 * receives a pointer to the dict* and implicitly updates it when the dict
 * struct itself was moved. Returns a stat of how many pointers were moved. */
long dictDefragTables(dict* d) {
    dictEntry **newtable;
    long defragged = 0;
    /* handle the first hash table */
    newtable = activeDefragAlloc(d->ht[0].table);   // 重新申请table的空间
    if (newtable)
        defragged++, d->ht[0].table = newtable;     // 指向新的地址
    /* handle the second hash table */
    if (d->ht[1].table) {
        newtable = activeDefragAlloc(d->ht[1].table);  // 如果有rehash情况下保存的数据也重新申请
        if (newtable)
            defragged++, d->ht[1].table = newtable;
    }
    return defragged;
}

空间转移的函数如下:

/* Defrag helper for generic allocations.
 *
 * returns NULL in case the allocation wasn't moved.
 * when it returns a non-null value, the old pointer was already released
 * and should NOT be accessed. */
void* activeDefragAlloc(void *ptr) {
    size_t size;
    void *newptr;
    if(!je_get_defrag_hint(ptr)) {
        server.stat_active_defrag_misses++;
        size = zmalloc_size(ptr);
        return NULL;
    }
    /* move this allocation to a new allocation.
     * make sure not to use the thread cache. so that we don't get back the same
     * pointers we try to free */
    size = zmalloc_size(ptr);   // 先获取数据的大小
    newptr = zmalloc_no_tcache(size);   // 不从线程缓存中申请数据
    memcpy(newptr, ptr, size);    // 拷贝数据 相当于获取了一块新的内存空间 而不是针对线程缓存的空间
    zfree_no_tcache(ptr);         // 释放旧的数据内存
    return newptr;
}

至此一个简单的内存申请与释放的流程完成。最主要的核心点是利用了Jemalloc的特性,不从线程缓存中获取内存,而是从新申请一块内存,从而将旧的有内存碎片的内存释放掉。

defragDictBucketCallback函数
/* Defrag scan callback for each hash table bucket,
 * used in order to defrag the dictEntry allocations. */
void defragDictBucketCallback(void *privdata, dictEntry **bucketref) {
    UNUSED(privdata); /* NOTE: this function is also used by both activeDefragCycle and scanLaterHash, etc. don't use privdata */
    while(*bucketref) {
        dictEntry *de = *bucketref, *newde;
        if ((newde = activeDefragAlloc(de))) {  // 重新申请对应的entry内存信息并依次遍历申请
            *bucketref = newde;
        }
        bucketref = &(*bucketref)->next;
    }
}

主要是为了替换db中对应的entry的内存置换。

当内存整理时间到期等待下一次整理
/* returns 0 if no more work needs to be been done, and 1 if time is up and more work is needed. */
int defragLaterStep(redisDb *db, long long endtime) {
    unsigned int iterations = 0;
    unsigned long long prev_defragged = server.stat_active_defrag_hits;
    unsigned long long prev_scanned = server.stat_active_defrag_scanned;
    long long key_defragged;

    do {
        /* if we're not continuing a scan from the last call or loop, start a new one */
        if (!defrag_later_cursor) {   // 如果不是新开始的迭代
            listNode *head = listFirst(db->defrag_later);    // 获取当前当地执行列表的头部

            /* Move on to next key */
            if (defrag_later_current_key) { // 如果需要移动到下一个节点则删除当前的节点
                serverAssert(defrag_later_current_key == head->value);
                listDelNode(db->defrag_later, head);
                defrag_later_cursor = 0;
                defrag_later_current_key = NULL;
            }

            /* stop if we reached the last one. */
            head = listFirst(db->defrag_later);  // 如果达到头部则停止
            if (!head)
                return 0;

            /* start a new key */
            defrag_later_current_key = head->value;  // 设置当前的key
            defrag_later_cursor = 0;
        }

        /* each time we enter this function we need to fetch the key from the dict again (if it still exists) */
        dictEntry *de = dictFind(db->dict, defrag_later_current_key);  // 查找key
        key_defragged = server.stat_active_defrag_hits;
        do {
            int quit = 0;
            if (defragLaterItem(de, &defrag_later_cursor, endtime))  // 继续遍历该de
                quit = 1; /* time is up, we didn't finish all the work */

            /* Once in 16 scan iterations, 512 pointer reallocations, or 64 fields
             * (if we have a lot of pointers in one hash bucket, or rehashing),
             * check if we reached the time limit. */
            if (quit || (++iterations > 16 ||
                            server.stat_active_defrag_hits - prev_defragged > 512 ||
                            server.stat_active_defrag_scanned - prev_scanned > 64)) {
                if (quit || ustime() > endtime) {
                    if(key_defragged != server.stat_active_defrag_hits)
                        server.stat_active_defrag_key_hits++;
                    else
                        server.stat_active_defrag_key_misses++;
                    return 1;
                }
                iterations = 0;
                prev_defragged = server.stat_active_defrag_hits;
                prev_scanned = server.stat_active_defrag_scanned;
            }
        } while(defrag_later_cursor);   //检查当前是否已经超时或者满足停止条件如果满足则直接停止
        if(key_defragged != server.stat_active_defrag_hits)
            server.stat_active_defrag_key_hits++;
        else
            server.stat_active_defrag_key_misses++;
    } while(1);
}

在执行过程中如果超过字段数限制或者超过时间则会通过defragLater来讲遍历的当前的key进行保存。

/* when the value has lots of elements, we want to handle it later and not as
 * part of the main dictionary scan. this is needed in order to prevent latency
 * spikes when handling large items */
void defragLater(redisDb *db, dictEntry *kde) {
    sds key = sdsdup(dictGetKey(kde));
    listAddNodeTail(db->defrag_later, key);  // 加入到defrag_later遍历列表中
}

至此一整个的内存碎片整理的过程基本上了解到。最主要的逻辑就是通过jemalloc特性重新申请一块内存,然后将数据移动到新内存上去,将旧的内存释放到从而完成内存的碎片整理。

总结

内存碎片整理,主要还是一个辅助性的功能,支持在redis提供服务的同时进行动态的内存整理,不过该特性目前仅仅支持在jemalloc库使用的时候使用。大概的操作是在每次时间事件中抽出一点时间来进行新的空间的申请然后再将数据拷贝过去,最后释放旧的空间,从而完成内存的腾挪提高内存利用率。不过该方案会对redis服务性能有所影响,对响应要求较高的情况下最好关闭内存回收,在低负载的时候手动开启整理完成之后就关闭。