INTEL X86,为什么对齐访问和非对齐访问性能一样?

INTEL X86,why do align access and non-align access have same performance?

来自 INTEL CPU 手册(Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3 (3A, 3B, 3C & 3D):System Programming Guide 8.1.1),它说“非对齐数据访问将严重影响处理器的性能”。然后我做一个测试来证明这一点。但结果是对齐和非对齐数据访问具有相同的性能。为什么???有人可以帮忙吗?我的代码如下所示:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
using namespace std;

static inline int64_t get_time_ns()
{
    std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
    return a.count();
}
int main(int argc, char** argv)
{
    if (argc < 2) {
        cout << "Usage:./test [01234567]" << endl;
        cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
        return 0;
    }
    uint64_t offset = atoi(argv[1]);
    cout << "offset = " << offset << endl;
    const uint64_t BUFFER_SIZE = 800000000;
    uint8_t* data_ptr = new uint8_t[BUFFER_SIZE];
    if (data_ptr == nullptr) {
        cout << "apply for memory failed" << endl;
        return 0;
    }
    memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
    const uint64_t LOOP_CNT = 300;
    cout << "start" << endl;
    auto start = get_time_ns();
    for (uint64_t i = 0; i < LOOP_CNT; ++i) {
        for (uint64_t j = offset; j <= BUFFER_SIZE - 8; j+= 8) { // align:offset = 0 nonalign: offset=1-7
            volatile auto tmp = *(uint64_t*)&data_ptr[j]; // read from memory
            //mov rax,QWORD PTR [rbx+rdx*1] // rbx+rdx*1 = 0x7fffc76fe019 
            //mov QWORD PTR [rsp+0x8],rax 
            ++tmp;
            //mov rcx,QWORD PTR [rsp+0x8] 
            //add rcx,0x1 
            //mov QWORD PTR [rsp+0x8],rcx
            *(uint64_t*)&data_ptr[j] = tmp; // write to memory
            //mov rcx,QWORD PTR [rbx+rdx*1],rcx
        }
    }
    auto end = get_time_ns();
    cout << "time elapse " << end - start << "ns" << endl;
    return 0;
}

结果:

offset = 0
start
time elapse 32991486013ns
offset = 1
start
time elapse 34089866539ns
offset = 2
start
time elapse 34011790606ns
offset = 3
start
time elapse 34097518021ns
offset = 4
start
time elapse 34166815472ns
offset = 5
start
time elapse 34081477780ns
offset = 6
start
time elapse 34158804869ns
offset = 7
start
time elapse 34163037004ns

在大多数现代 x86 内核上,对齐和未对齐的性能相同仅当访问不跨越特定的内部边界.

内部边界的确切大小根据相关 CPU 的核心架构而有所不同,但在过去十年的英特尔 CPU 上,相关边界是 64 字节缓存线。也就是说,完全属于 64 字节缓存行的访问执行相同的操作,无论它们是否对齐。

但是,如果(必然未对齐)访问 跨越 Intel 芯片上的高速缓存行边界,则会在延迟和吞吐量方面付出大约 2 倍的代价。这种惩罚的底线影响取决于周围的代码,通常远小于 2 倍,有时接近于零。如果还跨越 4K 页面边界,这种适度的惩罚可能会大得多。

对齐访问永远不会跨越这些边界,因此不会受到这种惩罚。

总体情况与 AMD 芯片类似,尽管相关边界在一些最近的芯片上小于 64 字节,并且加载和存储的边界不同。

我在我写的博客 post 的 load throughput and store throughput 部分包含了更多详细信息。

正在测试

由于以下几个原因,您的测试无法显示效果:

  • 测试未分配对齐的内存,您无法通过使用来自对齐未知的区域的偏移量来可靠地跨越缓存行。
  • 您一次迭代了 8 个字节,因此大部分写入(8 个中的 7 个)将落入缓存行中,而不会受到任何惩罚,从而导致一个小信号,只有在其余部分写入时才能检测到你的基准非常干净。
  • 您使用了很大的缓冲区大小,这不适合任何级别的缓存。分割线效应仅在 L1 处相当明显,或者当分割线意味着您引入两倍数量的线时(例如,随机访问)。由于您在任何一种情况下都线性访问每一行,因此无论是否拆分,您都会受到从 DRAM 到内核的吞吐量的限制:拆分写入在等待主内存时有足够的时间完成。
  • 你使用局部 volatile auto tmptmp++ 在堆栈上创建一个 volatile 和大量加载和存储以保留 volatile 语义:这些都是对齐的并且会消除你的效果正在尝试用你的测试来衡量。

这是我对您的测试所做的修改,仅在 L1 区域运行,并且一次前进 64 个字节,因此 每个 存储都将被拆分,如果有的话:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
#include <iomanip>

using namespace std;

static inline int64_t get_time_ns()
{
    std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
    return a.count();
}

int main(int argc, char** argv)
{
    if (argc < 2) {
        cout << "Usage:./test [01234567]" << endl;
        cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
        return 0;
    }
    uint64_t offset = atoi(argv[1]);
    const uint64_t BUFFER_SIZE = 10000;
    alignas(64) uint8_t data_ptr[BUFFER_SIZE];
    memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
    const uint64_t LOOP_CNT = 1000000;
    auto start = get_time_ns();
    for (uint64_t i = 0; i < LOOP_CNT; ++i) {
        uint64_t src = rand();
        for (uint64_t j = offset; j + 64<= BUFFER_SIZE; j+= 64) { // align:offset = 0 nonalign: offset=1-7
            memcpy(data_ptr + j, &src, 8);
        }
    }
    auto end = get_time_ns();
    cout << "time elapsed " << std::setprecision(2) << (end - start) / ((double)LOOP_CNT * BUFFER_SIZE / 64) <<
        "ns per write (rand:" << (int)data_ptr[rand() % BUFFER_SIZE] << ")" << endl;
    return 0;
}

运行 对于 0 到 64 之间的所有对齐,我得到:

$ g++ test.cpp -O2 && for off in {0..64}; do printf "%2d :" $off && ./a.out $off; done
 0 :time elapsed 0.56ns per write (rand:0)
 1 :time elapsed 0.57ns per write (rand:0)
 2 :time elapsed 0.57ns per write (rand:0)
 3 :time elapsed 0.56ns per write (rand:0)
 4 :time elapsed 0.56ns per write (rand:0)
 5 :time elapsed 0.56ns per write (rand:0)
 6 :time elapsed 0.57ns per write (rand:0)
 7 :time elapsed 0.56ns per write (rand:0)
 8 :time elapsed 0.57ns per write (rand:0)
 9 :time elapsed 0.57ns per write (rand:0)
10 :time elapsed 0.57ns per write (rand:0)
11 :time elapsed 0.56ns per write (rand:0)
12 :time elapsed 0.56ns per write (rand:0)
13 :time elapsed 0.56ns per write (rand:0)
14 :time elapsed 0.56ns per write (rand:0)
15 :time elapsed 0.57ns per write (rand:0)
16 :time elapsed 0.56ns per write (rand:0)
17 :time elapsed 0.56ns per write (rand:0)
18 :time elapsed 0.56ns per write (rand:0)
19 :time elapsed 0.56ns per write (rand:0)
20 :time elapsed 0.56ns per write (rand:0)
21 :time elapsed 0.56ns per write (rand:0)
22 :time elapsed 0.56ns per write (rand:0)
23 :time elapsed 0.56ns per write (rand:0)
24 :time elapsed 0.56ns per write (rand:0)
25 :time elapsed 0.56ns per write (rand:0)
26 :time elapsed 0.56ns per write (rand:0)
27 :time elapsed 0.56ns per write (rand:0)
28 :time elapsed 0.57ns per write (rand:0)
29 :time elapsed 0.56ns per write (rand:0)
30 :time elapsed 0.57ns per write (rand:25)
31 :time elapsed 0.56ns per write (rand:151)
32 :time elapsed 0.56ns per write (rand:123)
33 :time elapsed 0.56ns per write (rand:29)
34 :time elapsed 0.55ns per write (rand:0)
35 :time elapsed 0.56ns per write (rand:0)
36 :time elapsed 0.57ns per write (rand:0)
37 :time elapsed 0.56ns per write (rand:0)
38 :time elapsed 0.56ns per write (rand:0)
39 :time elapsed 0.56ns per write (rand:0)
40 :time elapsed 0.56ns per write (rand:0)
41 :time elapsed 0.56ns per write (rand:0)
42 :time elapsed 0.57ns per write (rand:0)
43 :time elapsed 0.56ns per write (rand:0)
44 :time elapsed 0.56ns per write (rand:0)
45 :time elapsed 0.56ns per write (rand:0)
46 :time elapsed 0.57ns per write (rand:0)
47 :time elapsed 0.57ns per write (rand:0)
48 :time elapsed 0.56ns per write (rand:0)
49 :time elapsed 0.56ns per write (rand:0)
50 :time elapsed 0.57ns per write (rand:0)
51 :time elapsed 0.56ns per write (rand:0)
52 :time elapsed 0.56ns per write (rand:0)
53 :time elapsed 0.56ns per write (rand:0)
54 :time elapsed 0.55ns per write (rand:0)
55 :time elapsed 0.56ns per write (rand:0)
56 :time elapsed 0.56ns per write (rand:0)
57 :time elapsed 1.1ns per write (rand:0)
58 :time elapsed 1.1ns per write (rand:0)
59 :time elapsed 1.1ns per write (rand:0)
60 :time elapsed 1.1ns per write (rand:0)
61 :time elapsed 1.1ns per write (rand:0)
62 :time elapsed 1.1ns per write (rand:0)
63 :time elapsed 1ns per write (rand:0)
64 :time elapsed 0.56ns per write (rand:0)

请注意,偏移量 57 到 63 每次写入所用的时间都是大约 2 倍,而对于 8 字节写入,这些恰好是跨越 64 字节(缓存行)边界的偏移量。