使用嵌套向量与展平向量包装器，奇怪的行为

Question

问题

长期以来，我的印象是使用嵌套的std::vector<std::vector...>来模拟N维数组通常是不好的，因为内存不能保证是连续的，并且可能会出现缓存未命中.我认为最好使用平面矢量并将多维映射到一维，反之亦然。因此，我决定对其进行测试（最后列出了代码）。这非常简单，我将 reading/writing 计时到嵌套的 3D 向量与我自己的 1D 向量的 3D 包装器。我使用 g++ 和 clang++ 编译了代码，并启用了 -O3 优化。对于每个运行我都改变了尺寸，所以我可以很好地了解行为。令我惊讶的是，这些是我在我的机器 MacBook Pro（Retina，13 英寸，2012 年末）、2.5GHz i5、8GB RAM，OS X 10.10.5:

上获得的结果

g++ 5.2

dimensions       nested   flat
X   Y   Z        (ms)     (ms) 

100 100 100  ->  16       24
150 150 150  ->  58       98
200 200 200  ->  136     308
250 250 250  ->  264     746
300 300 300  ->  440    1537

clang++ (LLVM 7.0.0)

dimensions       nested   flat
X   Y   Z        (ms)     (ms) 

100 100 100  ->  16       18
150 150 150  ->  53       61
200 200 200  ->  135     137
250 250 250  ->  255     271
300 300 300  ->  423     477

如您所见，"flatten" 包装器永远不会打败嵌套版本。此外，与 libc++ 实现相比，g++ 的 libstdc++ 实现执行得相当糟糕，例如对于 300 x 300 x 300，展平版本几乎比嵌套版本慢 4 倍。 libc++ 似乎具有相同的性能。

我的问题：

为什么扁平化版本没有更快？不应该吗？我在测试代码中遗漏了什么吗？
此外，为什么g++的libstdc++在使用flatten vectors时表现如此糟糕？再一次，它不应该表现得更好吗？

我使用的代码：

#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>

// Thin wrapper around flatten vector
template<typename T>
class Array3D
{
    std::size_t _X, _Y, _Z;
    std::vector<T> _vec;
public:
    Array3D(std::size_t X, std::size_t Y, std::size_t Z):
        _X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
    T& operator()(std::size_t x, std::size_t y, std::size_t z)
    {
        return _vec[z * (_X * _Y) + y * _X + x];
    }
    const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
    {
        return _vec[z * (_X * _Y) + y * _X + x];
    }
};

int main(int argc, char** argv)
{
    std::random_device rd{};
    std::mt19937 rng{rd()};
    std::uniform_real_distribution<double> urd(-1, 1);

    const std::size_t X = std::stol(argv[1]);
    const std::size_t Y = std::stol(argv[2]);
    const std::size_t Z = std::stol(argv[3]);


    // Standard library nested vector
    std::vector<std::vector<std::vector<double>>>
        vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));

    // 3D wrapper around a 1D flat vector
    Array3D<double> vec1D(X, Y, Z);

    // TIMING nested vectors
    std::cout << "Timing nested vectors...\n";
    auto start = std::chrono::steady_clock::now();
    volatile double tmp1 = 0;
    for (std::size_t x = 0 ; x < X; ++x)
    {
        for (std::size_t y = 0 ; y < Y; ++y)
        {
            for (std::size_t z = 0 ; z < Z; ++z)
            {
                vec3D[x][y][z] = urd(rng);
                tmp1 += vec3D[x][y][z];
            }
        }
    }
    std::cout << "\tSum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
    auto end = std::chrono::steady_clock::now();
    std::cout << "Took: ";
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << ms << " milliseconds\n";

    // TIMING flatten vector
    std::cout << "Timing flatten vector...\n";
    start = std::chrono::steady_clock::now();
    volatile double tmp2 = 0;
    for (std::size_t x = 0 ; x < X; ++x)
    {
        for (std::size_t y = 0 ; y < Y; ++y)
        {
            for (std::size_t z = 0 ; z < Z; ++z)
            {
                vec1D(x, y, z) = urd(rng);
                tmp2 += vec1D(x, y, z);
            }
        }
    }
    std::cout << "\tSum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
    end = std::chrono::steady_clock::now();
    std::cout << "Took: ";
    ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << ms << " milliseconds\n";
}

编辑

将 Array3D<T>::operator() return 更改为

return _vec[(x * _Y + y) * _Z + z];

as per 确实摆脱了 g++ 的 "weird" 行为，因为平面版本和嵌套版本现在花费的时间大致相同。然而，它仍然很有趣。由于缓存问题，我认为嵌套的会更糟。 我可以幸运地连续分配所有内存吗？

Answer 1

这是因为您在 3D 中对索引进行排序的方式 class。由于最内层的循环正在更改 z，这是索引的最大部分，因此您会遇到很多缓存未命中。将您的索引重新排列为

_vec[(x * _Y + y) * _Z + z]

你应该会看到更好的性能。

Answer 2

（这并没有真正回答问题。我想我最初是倒着读的，假设 OP 刚刚找到我所期待的，嵌套向量比平面向量慢。）

您应该预料到嵌套向量版本对于顺序访问以外的任何事情都比较慢。在为平面版本修复 row/column 主要索引顺序后，它在许多用途中应该更快，特别是因为编译器在大型平面数组上使用 SIMD 自动矢量化比在许多短 std::vector<>.

一个缓存行只有64B。那是 8 double 秒。由于有限的 TLB 条目，页面级别的局部性很重要，并且预取需要顺序访问，但是无论如何（足够接近）你都会得到它（足够接近）嵌套向量，这些向量在大多数 malloc 实现中一次分配。（这是一个简单的微基准测试，在分配其 vector 之前不做任何事情。在一个实际程序中，在进行大量小分配之前分配和释放一些内存，其中一些可能分散得更多。）

除了局部性之外，额外的间接级别也可能存在问题。

指向 std::vector 的引用/指针仅指向保存当前大小的固定大小块，已分配 space，以及指向缓冲区的指针。 IDK if any implementations place buffer right after the control data as part of the same malloced block，但这可能是不可能的，因为 sizeof(std::vector<int>) 必须是常量，所以你可以有一个向量向量。 Check out the asm on godbolt：仅 returns v[10] 的函数使用数组 arg 进行一次加载，但使用 std::vector arg 进行两次加载。

在嵌套向量实现中，加载 v[x][y][z] 需要 4 个步骤（假设指向 v 的指针或引用已经在寄存器中）。

加载 v.buffer_pointer 或 v.bp 或实现调用它的任何内容。（指向 std::vector<std::vector<double>> 数组的指针）
加载v.bp[x].bp（指向std::vector<double>数组的指针）
加载v.bp[x].bp[y].bp（指向double数组的指针）
加载v.bp[x].bp[y].bp[z]（我们想要的double）

一个适当的 3D 阵列，用单个 std::vector 模拟，只是：

加载v.bp（指向double数组的指针）
加载v.bp[(x*w + y)*h + z]（我们想要的double）

多次访问具有不同 x 和 y 的相同模拟 3D 数组需要计算新索引，但 v.bp 将保留在寄存器中。 所以我们只得到一个，而不是 3 个缓存未命中。

按顺序遍历 3D 数组隐藏了嵌套向量实现的代价，因为最内层向量中的所有值都有一个循环，隐藏了更改 x 和 y 的开销。预取外部向量中的连续指针在这里有所帮助，并且 Z 在您的测试中足够小，循环遍历一个最内部的向量不会驱逐下一个 y 值的指针。

What Every Programmer Should Know About Memory 有点过时了，但它涵盖了缓存和局部性的细节。软件预取不像在 P4 上那么重要，所以不要太在意指南的这一部分。

Answer 3

Reading over the other answers I'm not really satisfied with the accuracy and level of detail of the answers, so I'll attempt an explanation myself:

The man issue here is not indirection but a question of spatial locality:

There are basically two things that make caching especially effective:

Temporal locality, which means that a memory word that has been accessed recently will likely be accessed again in the near future. This might for example happen at nodes near the root of a binary search tree that is accessed frequently.
Spatial locality, which means that if a memory word has been accessed, it is likely that the memory words before or after this word will soon be accessed as well. This happens in our case, for nested and flattened arrays.

To assess the impact that indirection and cache effects might have on this problem, let's just assume that we have X = Y = Z = 1024

Judging from this question, a single cache line (L1, L2 or L3) is 64 bytes long, that means 8 double values. Let's assume the L1 cache has 32 kB (4096 doubles), the L2 cache has 256 kB (32k doubles) and the L3 cache has 8 MB (1M doubles).

This means that - assuming the cache is filled with no other data (which is a bold guess, I know) - in the flattened case only every 4th value of y leads to a L1 cache miss (the L2 cache latency is probably around 10-20 cycles), only every 32nd value of y leads to a L2 cache miss (the L3 cache latency is some value lower than 100 cycles) and only in case of a L3 cache miss we actually have to access main memory. I don't want to open up the whole calculation here, since taking the whole cache hierarchy into account makes it a bit more difficult, but let's just say that almost all accesses to memory can be cached in the flattened case.

In the original formulation of this question, the flattened index was computed differently (z * (_X * _Y) + y * _X + x), an increase of the value that changes in the innermost loop (z) always means a jump of _X * _Y * 64 bit, thus leading to a much more non-local memory layout, which increased cache faults by a large amount.

In the nested case, the answer depends a lot on the value of Z:

If Z is rather large, most of the memory accesses are contiguous, since they refer to the entries of a single vector in the nested vector<vector<vector>>>, which are laid out contiguously. Only when the y or x value is increased we need to actually use indirection to retrieve the start pointer of the next innermost vector.
If Z is rather small, we need to change our 'base pointer' memory access quite often, which might lead to cache-misses if the storage areas of the innermost vectors are placed rather randomly in memory. However, if they are more or less contiguous, we might observe minimal to no differences in cache performance.

Since there was a question about the assembly output, let me give a short overview:

If you compare the assembly output of the nested and flattened array, you notice a lot of similarities: There are three equivalent nested loops and the counting variables x, y and z are stored in registers. The only real difference - aside from the fact that the nested version uses two counters for every outer index to avoid multiplying by 24 on every address computation, and the flattened version does the same for the innermost loop and multiplying by 8 - can be found in the y loop where instead of just incrementing y and computing the flattened index, we need to do three interdependent memory loads to determine the base pointer for our inner loop:

mov rax, QWORD PTR [rdi]
mov rax, QWORD PTR [rax+r11]
mov rax, QWORD PTR [rax+r10]

But since these only happen every Zth time and the pointers for the 'middle vector' are most likely cached, the time difference is negligible.

Answer 4

为什么在固定索引顺序后，嵌套向量的速度与微基准中的 flat 大致相同：您希望平面数组更快（请参阅 , and 为什么嵌套向量通常很糟糕，但对于顺序访问来说还算不错）。但是您的特定测试做了很多事情，让乱序执行隐藏了使用嵌套向量的开销，and/or 只会减慢速度，以至于额外的开销在测量噪声中丢失。

我将 g++5.2 编译的内部循环的性能错误修复源代码 up on Godbolt so we can look at the asm 与 -O3 放在一起。（Apple 的 clang 分支可能类似于 clang3.7，但我只看 gcc 版本。）有很多来自 C++ 函数的代码，但您可以右键单击源代码行以滚动 asm windows 到该行的代码。此外，将鼠标悬停在源代码行上以加粗实现该行的 asm，反之亦然。

gcc 嵌套版本的内部两个循环如下（手工添加了一些注释）：

## outer-most loop not shown

.L213:  ## middle loop (over `y`)
    test    rbp, rbp        # Z
    je      .L127           # inner loop runs zero times if Z==0
    mov     rax, QWORD PTR [rsp+80]   # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]
    xor     r15d, r15d        # z = 0
    mov     rax, QWORD PTR [rax+r12]  # MEM[(struct vector * *)_195], MEM[(struct vector * *)_195]
    mov     rdx, QWORD PTR [rax+rbx]  # D.103857, MEM[(double * *)_38]

## Top of inner-most loop.
.L128:
    lea     rdi, [rsp+5328]   # tmp511,   ## function arg: pointer to the RNG object, which is a local on the stack.
    lea     r14, [rdx+r15*8]  # D.103851,  ## r14 = &(vec3D[x][y][z])
    call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)  #
    addsd   xmm0, xmm0    # D.103853, D.103853  ## return val *= 2.0: [0.0, 2.0]
    mov     rdx, QWORD PTR [rsp+80]   # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]   ## redo the pointer-chasing from vec3D.data()
    mov     rdx, QWORD PTR [rdx+r12]  # MEM[(struct vector * *)_150], MEM[(struct vector * *)_150]
    subsd   xmm0, QWORD PTR .LC6[rip]     # D.103859, ## and subtract 1.0:  [-1.0, 1.0]
    mov     rdx, QWORD PTR [rdx+rbx]  # D.103857, MEM[(double * *)_27]
    movsd   QWORD PTR [r14], xmm0 # *_155, D.103859        # store into vec3D[x][y][z]
    movsd   xmm0, QWORD PTR [rsp+64]      # D.103853, tmp1  # reload volatile tmp1
    addsd   xmm0, QWORD PTR [rdx+r15*8]   # D.103853, *_62  # add the value just stored into the array (r14 = rdx+r15*8 because nothing else modifies the pointers in the outer vectors)
    add     r15, 1    # z,
    cmp     rbp, r15  # Z, z
    movsd   QWORD PTR [rsp+64], xmm0      # tmp1, D.103853  # spill tmp1
    jne     .L128     #,
 #End of inner-most loop

.L127:  ## middle-loop
    add     r13, 1    # y,
    add     rbx, 24           # sizeof(std::vector<> == 24) == the size of 3 pointers.
    cmp     QWORD PTR [rsp+8], r13    # %sfp, y
    jne     .L213     #,

 ## outer loop not shown.

对于平面循环：

 ## outer not shown.
.L214:
    test    rbp, rbp        # Z
    je      .L135       #,
    mov     rax, QWORD PTR [rsp+280]  # D.103849, vec1D._Y
    mov     rdi, QWORD PTR [rsp+288]  # D.103849, vec1D._Z
    xor     r15d, r15d        # z
    mov     rsi, QWORD PTR [rsp+296]  # D.103857, MEM[(double * *)&vec1D + 24B]

.L136:  ## inner-most loop
    imul    rax, r12        # D.103849, x
    lea     rax, [rax+rbx]    # D.103849,
    imul    rax, rdi        # D.103849, D.103849
    lea     rdi, [rsp+5328]   # tmp520,
    add     rax, r15  # D.103849, z
    lea     r14, [rsi+rax*8]  # D.103851,       # &vec1D(x,y,z)
    call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)  #
    mov     rax, QWORD PTR [rsp+280]  # D.103849, vec1D._Y
    addsd   xmm0, xmm0    # D.103853, D.103853
    mov     rdi, QWORD PTR [rsp+288]  # D.103849, vec1D._Z
    mov     rsi, QWORD PTR [rsp+296]  # D.103857, MEM[(double * *)&vec1D + 24B]
    mov     rdx, rax  # D.103849, D.103849
    imul    rdx, r12        # D.103849, x       # redo address calculation a 2nd time per iteration
    subsd   xmm0, QWORD PTR .LC6[rip]     # D.103859,
    add     rdx, rbx  # D.103849, y
    imul    rdx, rdi        # D.103849, D.103849
    movsd   QWORD PTR [r14], xmm0 # MEM[(double &)_181], D.103859  # store into the address calculated earlier
    movsd   xmm0, QWORD PTR [rsp+72]      # D.103853, tmp2
    add     rdx, r15  # tmp374, z
    add     r15, 1    # z,
    addsd   xmm0, QWORD PTR [rsi+rdx*8]   # D.103853, MEM[(double &)_170]   # tmp2 += vec1D(x,y,z).  rsi+rdx*8 == r14, so this is a reload of the store this iteration.
    cmp     rbp, r15  # Z, z
    movsd   QWORD PTR [rsp+72], xmm0      # tmp2, D.103853
    jne     .L136     #,

.L135:  ## middle loop: increment y
    add     rbx, 1    # y,
    cmp     r13, rbx  # Y, y
    jne     .L214     #,

 ## outer loop not shown.

您的 MacBook Pro（2012 年末）有 Intel IvyBridge CPU, so I'm using numbers for that microarchitecture from Agner Fog's instruction tables and microarch guide。其他 Intel/AMD CPU 上的情况应该大致相同。

唯一的 2.5GHz 移动 IvB i5 是 i5-3210M，因此您的 CPU 具有 3MiB 的三级缓存。这意味着即使是最小的测试用例（100^3 * 8B / double ~= 7.63MiB）也大于最后一级缓存，因此 none 的测试用例完全适合缓存。这可能是一件好事，因为在测试它们中的任何一个之前，您都分配并默认初始化了 nested 和 flat。但是，您确实按照分配的相同顺序进行测试，因此如果嵌套数组在将平面数组归零后仍在缓存中，则平面数组在嵌套数组上的计时循环后在 L3 缓存中可能仍然很热。

如果您使用重复循环多次遍历同一个数组，您可以获得足够大的时间来测量较小的数组大小。

你在这里做了几件非常奇怪的事情，并使它变得如此缓慢，以至于乱序执行可以隐藏更改 y 的额外延迟，即使你的内心 z 向量不是完全连续的。

你运行定时循环内的慢 PRNG。 std::uniform_real_distribution<double> urd(-1, 1); 是 std::mt19937 rng{rd()}; 之上的额外开销，与 FP-add 延迟（3 个周期）或与每个周期 2 个 L1D 缓存加载吞吐量相比，这已经很慢了。所有这些额外的时间运行对 PRNG 进行乱序执行运行数组索引指令，以便最终地址在数据准备好时准备就绪。 除非您有很多缓存未命中，否则您主要只是测量 PRNG 速度，因为它产生的结果比每个时钟周期 1 个慢得多。

g++5.2 没有完全内联 urd(rng) 代码，x86-64 System V 调用约定没有调用保留的 XMM 寄存器。所以每个元素的 tmp1/tmp2 必须是 spilled/reloaded，即使它们不是 volatile.

它也失去了它在 Z 向量中的位置，并且在访问下一个 z 元素之前必须重做外部 2 层间接寻址。这是因为它不知道其调用函数的内部结构，并假设它可能有一个指向外部 vector<> 内存的指针。（平面版本在内部循环中进行两次索引乘法，而不是简单的指针加法。）

clang（使用 libc++）完全内联 PRNG，因此移动到下一个 z 只是 add reg, 8 以增加平面和嵌套版本中的指针。您可以通过在内部循环外部获取迭代器或获取对内部向量的引用来从 gcc 获得相同的行为，而不是重做 operator[] 并希望编译器为您提升它。

Intel/AMD FP add/sub/mul throughput/latency 不依赖于数据，非规范化除外。（，但 SSE 没有。64 位代码甚至对标量 float/double 使用 SSE。）所以你可以用零或 PRNG 初始化你的数组outisde 定时循环。（或者让它们归零，因为 vector<double> 构造函数会为你做这件事，而且它实际上需要额外的代码才能得到它 而不是 在你要写东西的情况下否则。）除法和 sqrt 性能取决于某些 CPU 的数据，并且比 add/sub/mul.
你在阅读之前在内部循环中写入每个元素。在源代码中，这看起来像 store/reload。不幸的是，这就是 gcc 实际上所做的，但是 libc++（内联 PRNG）的 clang 转换了循环体：
```
       // original
       vec3D[x][y][z] = urd(rng);
       tmp1 += vec3D[x][y][z];

       // what clang's asm really does
       double xmm7 = urd(rng);  
       vec3D[x][y][z] = xmm7;
       tmp1 += xmm7;
```
在 clang 的 asm 中：
```
                               # do { ...

addsd   xmm7, xmm4             # last instruction of the PRNG
movsd   qword ptr [r8], xmm7   # store it into the Z vector
addsd   xmm7, qword ptr [rsp + 88]
add     r8, 8                  # pointer-increment to walk along the Z vector
dec     r13                    # i--
movsd   qword ptr [rsp + 88], xmm7
jne     .LBB0_74               # }while(i != 0);
```
允许这样做是因为 vec3D 不是 volatile 或 atomic<>，所以任何其他线程同时写入此内存都是未定义的行为.这意味着它可以将 store/reloads 优化到内存中的对象到一个存储中（并且只需使用它存储的值，而无需重新加载）。或者如果它能证明它是一个死存储（一个没有任何东西可以读取的存储，例如一个未使用的 static 变量），则完全优化存储。

在gcc的版本中，它在调用PRNG之前为store做索引，之后为reload做索引。所以我认为 gcc 不确定函数调用不会修改指针，因为指向外部向量的指针已经逃脱了函数。（并且 PRNG 不内联）。

然而，即使是 asm 中的实际 store/reload 仍然比简单加载对缓存未命中更不敏感！

存储->负载转发仍然有效，即使存储未命中缓存。因此 Z 向量中的缓存未命中不会直接延迟关键路径。如果乱序执行无法隐藏缓存未命中的延迟，它只会减慢您的速度。（一旦数据写入存储缓冲区（并且所有先前的指令都已退出），存储就可以退出）。我不确定负载是否可以在缓存行到达 L1D 之前退出，如果它得到它的数据来自存储转发。它可能能够，因为 x86 确实允许 StoreLoad 重新排序（允许存储在加载后变得全局可见）。在这种情况下，store/reload 只会为 PRNG 添加 6 个延迟周期结果（脱离了从一个 PRNG 状态到下一个 PRNG 状态的关键路径）。如果它缓存未命中太多以至于存储缓冲区填满并阻止执行新的存储微指令，这最终会阻止新的存储微指令，那么它只是吞吐量的瓶颈当保留站或 ROB 填满未执行或未退役（分别）的微指令时，微指令从发布到无序核心中。

使用反向索引（平面代码的原始版本），主要瓶颈可能是分散的商店。 IDK 为什么 clang 在那里比 gcc 做得好得多。毕竟，也许 clang 设法反转循环并按顺序遍历内存。（因为它完全内联了 PRNG，所以没有函数调用需要内存状态来匹配程序顺序。）

按顺序遍历每个 Z 向量意味着缓存未命中相对较远（即使每个 Z 向量与前一个不连续），为存储执行提供大量时间。或者即使在 L1D 缓存实际拥有缓存行（在 MESI 协议的修改状态下）之前存储转发的加载实际上不能退出，推测执行具有正确的数据并且不必等待延迟缓存未命中。无序指令 window 可能足够大，可以防止关键路径在负载退出之前停止。（缓存未命中负载通常非常糟糕，因为如果没有数据供其操作，则无法执行相关指令。因此它们更容易在管道中产生气泡。DRAM 的完整缓存未命中延迟超过 300周期，并且无序 window 在 IvB 上为 168 微指令，它无法隐藏每个时钟甚至 1 微指令（大约 1 条指令）执行代码的所有延迟。）对于纯存储，乱序 window 超出了 ROB 大小，因为他们不需要承诺 L1D 就可以退休。事实上，他们不能直到他们退休后才做出承诺，因为那时候他们是众所周知的非投机者。（因此，让它们早于全局可见可以防止在检测到异常或错误推测时回滚。）

我的桌面上没有安装 libc++，所以我无法根据 g++ 对该版本进行基准测试。使用 g++5.4，我发现嵌套：225 毫秒和平面：239 毫秒。我怀疑额外的数组索引乘法是一个问题，并且与 PRNG 使用的 ALU 指令竞争。相比之下，嵌套版本重做一堆在 L1D 缓存中命中的指针追逐可以并行发生。我的桌面是 4.4GHz 的 Skylake i7-6700k。 SKL 的 ROB（重新排序缓冲区）大小为 224 微指令，RS 为 97 微指令，so the out-of-order window is very large。它还具有 4 个周期的 FP-add 延迟（与之前的 3 个周期不同）。
volatile double tmp1 = 0; 你的累加器是volatile，它强制编译器在每次内循环迭代时store/reload它。 内循环中循环承载依赖链的总延迟为 9 个周期：3 个用于 addsd，6 个用于从 movsd 存储到 movsd 重新加载的存储转发. （clang 使用 addsd xmm7, qword ptr [rsp + 88] 将重新加载折叠到内存操作数中，但区别相同。（[rsp+88] 在堆栈上，存储具有自动存储的变量，如果它们需要从寄存器中溢出。）

如上所述，gcc 的非内联函数调用还将在 x86-64 System V 调用约定中强制使用 spill/reload（除了 Windows 之外的所有内容都使用）。但是一个聪明的编译器可以完成 4 个 PRNG 调用，例如，然后是 4 个数组存储。（如果您使用迭代器来确保 gcc 知道包含其他向量的向量没有发生变化。）

使用 -ffast-math 会让编译器自动向量化（如果不是 PRNG 和 volatile）。这将使您运行足够快地遍历数组，以至于不同 Z 向量之间缺乏局部性可能是一个真正的问题。它还可以让编译器展开多个累加器，以隐藏 FP-add 延迟。例如他们可以（而且 clang 会）使 asm 等同于：
```
float t0=0, t1=0, t2=0, t3=0;
for () {
   t0 += a[i + 0];
   t1 += a[i + 1];
   t2 += a[i + 2];
   t3 += a[i + 3];
}
t0 = (t0 + t1) + (t2 + t3);
```
它有 4 个独立的依赖链，所以它可以保持 4 个 FP 添加在飞行中。由于 IvB 有 3 个周期的延迟，addsd 的每个时钟吞吐量一个，我们只需要保持 4 个处于飞行状态即可使其吞吐量饱和。（Skylake 有 4c 延迟，每个时钟吞吐量 2，与 mul 或 FMA 相同，因此您需要 8 个累加器以避免延迟瓶颈。实际上，。正如该问题的提问者所测试的那样，Haswell 甚至做得更好当接近最大负载吞吐量时更多的累加器。）

类似的东西可以更好地测试循环 Array3D 的效率。 如果您想完全停止循环优化，只需使用结果。测试你的微基准测试以确保增加问题的规模会增加时间；如果不是，那么某些东西被优化掉了，或者你没有测试你认为你正在测试的东西。不要使内部循环临时 volatile!!

编写微基准测试并不容易。您必须足够了解才能编写一个测试您认为正在测试的内容的程序。 :P 这是一个很容易出错的好例子。

May I just be lucky and have all the memory contiguously allocated?

是的，当您在执行此操作之前没有分配和释放任何东西时，许多按顺序完成的小分配可能会发生这种情况。如果它们足够大（通常是一个 4kiB 页面或更大），glibc malloc 将切换为使用 mmap(MAP_ANONYMOUS)，然后内核将选择随机虚拟地址 (ASLR)。因此，对于较大的 Z，您可能会期望局部性变得更糟。但另一方面，较大的 Z 向量意味着您会花费更多的时间遍历一个连续的向量，因此在更改 y（和 x）时缓存未命中变得相对不那么重要。

用你的数据顺序循环显然不会暴露这一点，因为额外的指针访问命中缓存，所以指针追逐具有足够低的 OOO 执行延迟，可以用你的慢循环隐藏它。

Prefetch 很容易保持在这里。

不同的编译器/库会对这个奇怪的测试产生很大的影响。在我的系统上（Arch Linux，i7-6700k Skylake，最大睿频为 4.4GHz），g++5.4 -O3 在 300 300 300 的 4 运行s 中最好的是：

Timing nested vectors...
        Sum: 579.78
Took: 225 milliseconds
Timing flatten vector...
        Sum: 579.78
Took: 239 milliseconds

 Performance counter stats for './array3D-gcc54 300 300 300':

    532.066374      task-clock (msec)         #    1.000 CPUs utilized          
             2      context-switches          #    0.004 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
        54,523      page-faults               #    0.102 M/sec                  
 2,330,334,633      cycles                    #    4.380 GHz                    
 7,162,855,480      instructions              #    3.07  insn per cycle         
   632,509,527      branches                  # 1188.779 M/sec                  
       756,486      branch-misses             #    0.12% of all branches        

   0.532233632 seconds time elapsed

对比g++7.1 -O3（显然决定分支 g++5.4 没有的东西）

Timing nested vectors...
        Sum: 932.159
Took: 363 milliseconds
Timing flatten vector...
        Sum: 932.159
Took: 378 milliseconds

 Performance counter stats for './array3D-gcc71 300 300 300':

    810.911200      task-clock (msec)         #    1.000 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
        54,523      page-faults               #    0.067 M/sec                  
 3,546,467,563      cycles                    #    4.373 GHz                    
 7,107,511,057      instructions              #    2.00  insn per cycle         
   794,124,850      branches                  #  979.299 M/sec                  
    55,074,134      branch-misses             #    6.94% of all branches        

   0.811067686 seconds time elapsed

对比clang4.0 -O3（使用 gcc 的 libstdc++，而不是 libc++）

perf stat ./array3D-clang40-libstdc++ 300 300 300
Timing nested vectors...
        Sum: -349.786
Took: 1657 milliseconds
Timing flatten vector...
        Sum: -349.786
Took: 1631 milliseconds

 Performance counter stats for './array3D-clang40-libstdc++ 300 300 300':

   3358.297093      task-clock (msec)         #    1.000 CPUs utilized          
             9      context-switches          #    0.003 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
        54,521      page-faults               #    0.016 M/sec                  
14,679,919,916      cycles                    #    4.371 GHz                    
12,917,363,173      instructions              #    0.88  insn per cycle         
 1,658,618,144      branches                  #  493.887 M/sec                  
       916,195      branch-misses             #    0.06% of all branches        

   3.358518335 seconds time elapsed

我没有深究clang做错了什么，或者用-ffast-mathand/or-march=native试试。（不过，除非您删除 volatile，否则它们不会有太大作用。）

perf stat -d 显示 clang 的缓存未命中数（L1 或最后一级）并不比 gcc 多。但它确实表明 clang 的 L1D 负载是原来的两倍多。

我确实尝试过使用非方阵。保持总元素数相同但将最终维度更改为 5 或 6 几乎完全相同。

即使是对 C 语言的微小改动也会有所帮助，并使 "flatten" 比嵌套 gcc 更快（对于 300^3，从 240 毫秒减少到 220 毫秒，但对嵌套几乎没有任何影响。）：

 // vec1D(x, y, z) = urd(rng);
    double res = urd(rng);
    vec1D(x, y, z) = res;   // indexing calculation only done once, after the function call
    tmp2 += vec1D(x, y, z);
    // using iterators would still avoid redoing it at all.

Answer 5

May I just be lucky and have all the memory contiguously allocated?

可能是的。我对您的示例做了一些修改，因此我们有一个基准，它更侧重于两种方法之间的差异：

数组填充是在单独的通道中完成的，因此不包括随机生成器速度
移除了易失性
修复了一个小错误（打印了 tmp1 而不是 tmp2）
添加了 #if 1 部分，可用于随机化内存中的 vec3D 位置。事实证明，它在我的机器上有很大的不同。

没有随机化（我使用的参数：300 300 300）：

Timing nested vectors...
    Sum: -131835
Took: 2122 milliseconds
Timing flatten vector...
    Sum: -131835
Took: 2085 milliseconds

所以还是有点区别的，flatten 版本快一点。（我已经运行进行了多次测试，并将最少的时间放在这里）。

随机化：

Timing nested vectors...
    Sum: -117685
Took: 3014 milliseconds
Timing flatten vector...
    Sum: -117685
Took: 2085 milliseconds

所以这里可以看到缓存效果：嵌套版本慢了 ~50%。这是因为CPU无法预测哪个内存区域将被使用，所以它的缓存预取器效率不高。

修改后的代码如下：

#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>

template<typename T>
class Array3D
{
    std::size_t _X, _Y, _Z;
    std::vector<T> _vec;
    public:
    Array3D(std::size_t X, std::size_t Y, std::size_t Z):
        _X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
    T& operator()(std::size_t x, std::size_t y, std::size_t z)
    {
        return _vec[(x * _Y + y) * _Z + z];

    }
    const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
    {
        return _vec[(x * _Y + y) * _Z + z];
    }
};

double nested(std::vector<std::vector<std::vector<double>>> &vec3D, std::size_t X, std::size_t Y, std::size_t Z) {
    double tmp1 = 0;
    for (int iter=0; iter<100; iter++)
        for (std::size_t x = 0 ; x < X; ++x)
        {
            for (std::size_t y = 0 ; y < Y; ++y)
            {
                for (std::size_t z = 0 ; z < Z; ++z)
                {
                    tmp1 += vec3D[x][y][z];
                }
            }
        }
    return tmp1;
}

double flatten(Array3D<double> &vec1D, std::size_t X, std::size_t Y, std::size_t Z) {
    double tmp2 = 0;
    for (int iter=0; iter<100; iter++)
        for (std::size_t x = 0 ; x < X; ++x)
        {
            for (std::size_t y = 0 ; y < Y; ++y)
            {
                for (std::size_t z = 0 ; z < Z; ++z)
                {
                    tmp2 += vec1D(x, y, z);
                }
            }
        }
    return tmp2;
}

int main(int argc, char** argv)
{
    std::random_device rd{};
    std::mt19937 rng{rd()};
    std::uniform_real_distribution<double> urd(-1, 1);

    const std::size_t X = std::stol(argv[1]);
    const std::size_t Y = std::stol(argv[2]);
    const std::size_t Z = std::stol(argv[3]);

    std::vector<std::vector<std::vector<double>>>
        vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));

#if 1
    for (std::size_t i = 0 ; i < X*Y; i++)
    {
        std::size_t xa = rand()%X;
        std::size_t ya = rand()%Y;
        std::size_t xb = rand()%X;
        std::size_t yb = rand()%Y;
        std::swap(vec3D[xa][ya], vec3D[xb][yb]);
    }
#endif

    // 3D wrapper around a 1D flat vector
    Array3D<double> vec1D(X, Y, Z);

    for (std::size_t x = 0 ; x < X; ++x)
    {
        for (std::size_t y = 0 ; y < Y; ++y)
        {
            for (std::size_t z = 0 ; z < Z; ++z)
            {
                vec3D[x][y][z] = vec1D(x, y, z) = urd(rng);
            }
        }
    }

    std::cout << "Timing nested vectors...\n";
    auto start = std::chrono::steady_clock::now();
    double tmp1 = nested(vec3D, X, Y, Z);
    auto end = std::chrono::steady_clock::now();
    std::cout << "\tSum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
    std::cout << "Took: ";
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << ms << " milliseconds\n";

    std::cout << "Timing flatten vector...\n";
    start = std::chrono::steady_clock::now();
    double tmp2 = flatten(vec1D, X, Y, Z);
    end = std::chrono::steady_clock::now();
    std::cout << "\tSum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
    std::cout << "Took: ";
    ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << ms << " milliseconds\n";
}

使用嵌套向量与展平向量包装器，奇怪的行为

Using nested vectors vs a flatten vector wrapper, strange behaviour

c++

performance

caching

vector

performance-testing