CPU Cache对于并发编程的影响

原创

拾牙慧者 2022-06-27 23:24:44 ©著作权

文章标签 c++ c语言 visual studio #include 字节对齐 文章分类 Spark 大数据

©著作权归作者所有：来自51CTO博客作者拾牙慧者的原创作品，请联系作者获取转载授权，否则将追究法律责任

文章目录

引子
CPU Cache对于并发的影响
读写顺序对性能的影响
字节对齐对Cache的影响
小结

引子

下面给出两个极其相似的代码，运行出的时间却是有很大差别：
代码一

#include <stdio.h>
#include <pthread.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>

const uint32_t MAX_THREADS = 16;

void* ThreadFunc(void* pArg)
{
    for (int i = 0; i < 1000000000; ++i) // 10亿次累加操作
    {
        ++*(uint64_t*)pArg;
    }
    return NULL;
}



int main() {

    static uint64_t aulArr[MAX_THREADS * 8];
    pthread_t aulThreadID[MAX_THREADS];
    auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    for (int i = 0; i < MAX_THREADS; ++i)
    {
        assert(0 == pthread_create(&aulThreadID[i], nullptr, ThreadFunc, &aulArr[i]));
    }
    for (int i = 0; i < MAX_THREADS; ++i)
    {
        assert(0 == pthread_join(aulThreadID[i], nullptr));
    }
    auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    printf("%lld",end.count() - begin.count());
}

耗时： 26396ms

代码二

#include <stdio.h>
#include <pthread.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>

const uint32_t MAX_THREADS = 16;

void* ThreadFunc(void* pArg)
{
    for (int i = 0; i < 1000000000; ++i) // 10亿次累加操作
    {
        ++*(uint64_t*)pArg;
    }
    return NULL;
}



int main() {

    static uint64_t aulArr[MAX_THREADS * 8];
    pthread_t aulThreadID[MAX_THREADS];
    auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    for (int i = 0; i < MAX_THREADS; ++i)
    {
        assert(0 == pthread_create(&aulThreadID[i], nullptr, ThreadFunc, &aulArr[i * 8]));
    }
    for (int i = 0; i < MAX_THREADS; ++i)
    {
        assert(0 == pthread_join(aulThreadID[i], nullptr));
    }
    auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    printf("%lld",end.count() - begin.count());
}

耗时： 6762ms

这两者的主要差别就在于pthread_create传入的一个是aulArr[i]一个是aulArr[i * 8]

CPU Cache对于并发的影响

cpu cache在做数据同步的时候，有个最小的单位：cache line，当前主流CPU为64字节。
多个CPU读写相同的Cache line的时候需要做一致性同步，多CPU访问相同的Cache Line地址，数据会被反复写脏，频繁进行一致性同步。当多CPU访问不同的Cache Line地址时，无需一致性同步。
在上面的程序中：
static uint64_t aulArr[MAX_THREADS * 8]; 占用的数据长度为：8byte * 8 * 16；
8byte * 8=64byte
程序一，每个线程在当前CPU读取数据时,访问的是同一块cache line
程序二，每个线程在当前CPU读取数据时，访问的是不同块的cache line,避免了对一个流水线的反复擦写，效率直线提升。
CPU Cache对于并发编程的影响_visual studio

读写顺序对性能的影响

CPU会有一个预读，顺带着将需要的块儿旁边的块儿一起读出来放到cache中。所以当我们顺序读的时候就不需要从内存里面读了，可以直接在缓存里面读。
顺序读

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"

int main() {
    const uint32_t BLOCK_SIZE = 8 << 20;

    // 64字节地址对齐，保证每一块正好是一个CacheLine
    static char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));
    assert((uint64_t)memory % 64 == 0);
    memset(memory, 0x3c, sizeof(memory));

    int n = 10;
    auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    while (n--)
    {
        char result = 0;
        for (int i = 0; i < BLOCK_SIZE; ++i)
        {
            for (int j = 0; j < 64; ++j)
            {
                result ^= memory[i][j];
            }
        }
    }
    auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    printf("%lld",end.count() - begin.count());
}

乱序读

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"

int main() {
    const uint32_t BLOCK_SIZE = 8 << 20;

    // 64字节地址对齐，保证每一块正好是一个CacheLine
    static char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));
    assert((uint64_t)memory % 64 == 0);
    memset(memory, 0x3c, sizeof(memory));

    int n = 10;
    auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    while (n--)
    {
        char result = 0;
        for (int i = 0; i < BLOCK_SIZE; ++i)
        {
            int k = i * 5183 % BLOCK_SIZE;  // 人为打乱顺序
            for (int j = 0; j < 64; ++j)
            {
                result ^= memory[k][j];
            }
        }
    }
    auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    printf("%lld",end.count() - begin.count());
}

顺序读耗时13547ms，随机乱序读耗时21395ms。
如果一定要随机读的话该怎么优化呢？
如果我们知道我们下一轮读取的数据，并且不是要立即访问这个地址的话，使用_mm_prefetch指令优化，告诉CPU提前预读下一轮循环的cacheline
有关该指令可以参考官方文档：https://docs.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2010/84szxsww(v=vs.100)
使用该命令后，再看看运行时间：

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"
#include "xmmintrin.h"

int main() {
    const uint32_t BLOCK_SIZE = 8 << 20;

    // 64字节地址对齐，保证每一块正好是一个CacheLine
    static char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));
    assert((uint64_t)memory % 64 == 0);
    memset(memory, 0x3c, sizeof(memory));

    int n = 10;
    auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    while (n--)
    {
        char result = 0;
        for (int i = 0; i < BLOCK_SIZE; ++i)
        {
            int next_k = (i + 1) * 5183 % BLOCK_SIZE;
            _mm_prefetch(&memory[next_k][0], _MM_HINT_T0);
            int k = i * 5183 % BLOCK_SIZE;  // 人为打乱顺序
            for (int j = 0; j < 64; ++j)
            {
                result ^= memory[k][j];
            }
        }
    }
    auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    printf("%lld",end.count() - begin.count());
}

从原来的21395ms优化到15291ms

字节对齐对Cache的影响

在2GB内存，int64为单元进行26亿次异或。分别测试地址对齐与非对齐在顺序访问和随机访问下的耗时

	非地址对齐	地址对齐	耗时比
顺序访问	7.8s	7.7s	1.01:1
随机访问	90s	80s	1.125:1

在顺序访问时，Cache命中率高，且CPU预读，此时差别不大。
在随机访问的情况下，Cache命中率几乎为0，有1/8概率横跨2个cacheline，此时需读两次内存，此时耗时比大概为：7 / 8 * 1 + 1 / 8 * 2 = 1.125
结论就是：
1、cacheline 内部访问非字节对齐变量差别不大
2、跨cacheline访问代价主要为额外的内存读取开销
所以除了网络协议以外，避免出现1字节对齐的情况。可以通过调整成员顺序，减少内存开销。