无法重现不友好的 C++ 缓存代码

Question

我正在玩一些代码块，它们有意尝试展示有关 CPU 缓存的想法，以及如何从对齐数据中获益等等。

从这篇文章 http://igoro.com/archive/gallery-of-processor-cache-effects/ 中我发现它非常有趣我正在尝试重现这两个循环的行为（示例 1）：

int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;

注意第二个循环中的因子 16，因为在我的机器（Intel i5）中 int 的长度为 4 个字节，第二个循环中的每个元素都将位于不同的缓存行中，因此从理论上讲，根据作者的说法，这两个循环将花费几乎相同的时间来执行，因为这里的速度是由内存访问决定的，而不是由整数乘法指令决定的。

所以我尝试重现这种行为，使用两个不同的编译器，MSVC 19.31.31105.0 和 GNU g++ 9.4.0，两种情况下的结果相同，第二个循环大约需要 4%时间相对于第一个循环。这是我用来重现它的实现

#include <cstdlib>
#include <chrono>
#include <vector>
#include <string>

#define SIZE 64 * 1024 * 1024
#define K 16

//Comment this to use C array instead of std vector
//#define USE_C_ARRAY 

int main() {

#if defined(USE_C_ARRAY)
    int* arr = new int[SIZE];
    std::cout << "Testing with C style array\n";
#else 
    std::vector<int> arr(SIZE);
    std::cout << "Using std::vector\n";
#endif

    std::cout << "Step:           " << K << " Elements \n";
    std::cout << "Array Elements: " << SIZE << "\n";
    std::cout << "Array Size:     " << SIZE/(1024*1024) << " MBs \n";
    std::cout << "Int Size:       " << sizeof(int) << "\n";

    auto show_duration = [&](auto title, auto duration){ 
        std::cout << "[" << title << "] Duration: " << 
            std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() <<
            " (milisecs) \n";
    };

    auto start = std::chrono::steady_clock::now();
    // Loop 1
    for (int i = 0; i < SIZE; i++) arr[i]++;
    show_duration("FullArray", 
                (std::chrono::steady_clock::now() - start));


    start = std::chrono::steady_clock::now();
    // Loop 1
    for (int i = 0; i < SIZE; i += K) arr[i]++;

    show_duration(std::string("Array Step ") + std::to_string(K), 
                (std::chrono::steady_clock::now() - start));

#if defined(USE_C_ARRAY)
    delete[] arr;
#endif
    return EXIT_SUCCESS;
}

用g++编译：

 g++ -o for_cache_demo for_cache_demo.cpp -std=c++17 -g3

注 1：我使用 -g3 明确表明我正在避免任何编译器优化，只是为了查看原始行为。

注2：两次循环生成的汇编代码的区别在于循环计数器add函数的第二个参数，即i加1或K

add     DWORD PTR [rbp-4], 16 # When using a 16 Step

Link 在 Compiler Explorer 中测试：https://godbolt.org/z/eqWdce35z

如果您对我可能遗漏的内容有任何建议或想法，那就太好了，非常感谢！

Answer 1

你的错误是没有使用任何优化。 non-optimized 代码的开销掩盖了任何缓存效果。

当我不使用任何优化时，我得到以下信息：

$ g++ -O0 cache.cpp -o cache
$ ./cache 
Using std::vector
Step:           16 Elements 
Array Elements: 67108864
Array Size:     64 MBs 
Int Size:       4
[FullArray] Duration: 105 (milisecs) 
[Array Step 16] Duration: 23 (milisecs)

如果我使用优化，即使是最保守的优化标志，-O1，结果突然改变：

$ g++ -O1 cache.cpp -o cache
$ ./cache 
Using std::vector
Step:           16 Elements 
Array Elements: 67108864
Array Size:     64 MBs 
Int Size:       4
[FullArray] Duration: 26 (milisecs) 
[Array Step 16] Duration: 22 (milisecs)

(可以看到step 16的运行时间变化不大，因为已经被内存访问瓶颈了，而step 1访问的运行时间大幅提升，因为是受执行指令限制，不受内存带宽限制）

进一步提高优化级别并没有改变我的结果。

您可以比较不同优化级别的内部循环。随着 -O0:

vector_step_1(std::vector<int, std::allocator<int> >&):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     QWORD PTR [rbp-24], rdi
        mov     DWORD PTR [rbp-4], 0
.L7:
        cmp     DWORD PTR [rbp-4], 67108863
        jg      .L8
        mov     eax, DWORD PTR [rbp-4]
        movsx   rdx, eax
        mov     rax, QWORD PTR [rbp-24]
        mov     rsi, rdx
        mov     rdi, rax
        call    std::vector<int, std::allocator<int> >::operator[](unsigned long)
        mov     rdx, rax
        mov     ecx, DWORD PTR [rdx]
        mov     eax, ecx
        add     eax, eax
        add     eax, ecx
        mov     DWORD PTR [rdx], eax
        add     DWORD PTR [rbp-4], 1
        jmp     .L7
.L8:
        nop
        leave
        ret

您可以看到它在每次迭代时都会调用整个函数。

与-O1:

vector_step_1(std::vector<int, std::allocator<int> >&):
        mov     eax, 0
.L5:
        mov     rdx, rax
        add     rdx, QWORD PTR [rdi]
        mov     ecx, DWORD PTR [rdx]
        lea     ecx, [rcx+rcx*2]
        mov     DWORD PTR [rdx], ecx
        add     rax, 4
        cmp     rax, 268435456
        jne     .L5
        ret

现在，循环中只有 8 条指令。在 -O2 时，每个循环变成 6 条指令。

无法重现不友好的 C++ 缓存代码

Couldn't Reproduce non-friendly C++ cache code

c++

cpu-cache