当内部循环迭代次数较少时，性能会有显着差异吗？

Question

我写了一个玩具程序来比较两个非常相似的函数的性能。整个文件（减去几个宏）如下所示：

constexpr int iterations = 100000;

using u64 = uint64_t;

// Slow.
void test1()
{
    u64 sum = 0;

    for (int i = 0; i < iterations; i++) {
        for (int j = 0; j < 4; j++)
            sum += i + j;
        doNotOptimize(sum);
    }
}

// Fast.
void test2()
{
    u64 sum = 0;

    for (int i = 0; i < iterations; i++) {
        for (int j = 0; j < 10; j++)
            sum += i + j;
        doNotOptimize(sum);
    }
}

void driver()
{
    int runs = 1000;
    BENCH(runs, test1);
    BENCH(runs, test2);
}

我正在使用 __rdtsc 测量每个函数的 1000 次执行并计算平均值。使用此公式，我看到 test1 和 test2 之间的性能差异约为 172,000（滴答？）。令人惊讶的是 test2 是更快的那个。

一个奇特的小细节是 test1 较慢的唯一幻数是 4、8 和 16。如果我将内部循环的条件更改为 j < x，其中 x 不是这 3 个数字，性能匹配。

在汇编中，我观察到两个函数中的内部循环都被消除了，取而代之的是作为 lea 的操作数执行的一些算术运算。所以在这种情况下，如果两个函数都同样快，那将是有意义的。但这根本不是正在发生的事情。这是完整的反汇编和程序源代码：https://godbolt.org/z/d5PsG4YeY

所以到底发生了什么？我的测量有问题吗？

执行环境：

Processor: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Kaby Lake)
L1 Cache: 128Kib
OS: Linux (64-bit)
Toolchain: GCC Version 10.3.0
Compiler Options: -O3 -fno-tree-vectorize

Answer 1

4 和 8 是 x86 寻址模式支持的比例因子，诱使 GCC 在添加 4*i 时在关键路径依赖链上使用 slow-LEA 或 8*i 到 sum 以及 1..4 或 1..8 的常数和（无论哪种方式都只是一个常数，无关紧要）。显然也是 16 的 dep 链的一部分。并且您使用内联 asm 强制 sum dep 链包含 store/reload.

Analyzing the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations done as operands of lea. So in this case, it would make sense if both functions ran at the same speed.

它们都快，但是 i 的不同倍数采用不同的指令。因此，没有理由假设它们会相同。这里有趣的是指令总数越多的指令越快，因为它通过 sum.

的依赖链更短

并且您用 somewhat-clunky asm volatile("" :: "g"(&sum) : "memory") 强制了它的 store/reload，而不是仅仅要求编译器将值保存在 asm volatile("" : "+r"(sum)) 的寄存器中。因此，该 dep 链包括 store-forwarding 延迟（通常为 3 到 5 个周期），因此它是比 front-end 或独立工作的 ALU 吞吐量更大的瓶颈。

test1():
        xor     eax, eax                   # i = 0
        xor     ecx, ecx                   # sum = 0
        lea     rsi, [rsp-8]                 # operand for inline asm
        jmp     .L3
.L5:
        mov     rcx, QWORD PTR [rsp-8]     # load sum
        mov     rax, rdx                   # i = i+1
.L3:
        lea     rdx, [rax+1]               # i+1, leaving the old value in RAX
        lea     rax, [rcx+6+rax*4]         # sum += 4*i + 6.  (1+2+3)
        mov     QWORD PTR [rsp-8], rax     # store sum
        cmp     rdx, 100000
        jne     .L5
        ret

在寻址模式下具有 3 个组件（两个 + 操作）的 LEA 在您的 Skylake-derived CPU 上有 3 个周期延迟。参见

所以 loop-carried 依赖链是 slow-LEA（3 个周期）在 load/store 之间。

test2():
        xor     eax, eax
        xor     ecx, ecx
        lea     rdi, [rsp-8]                  # operand for inline asm
        jmp     .L8
.L9:
        mov     rcx, QWORD PTR [rsp-8]      # load sum
        mov     rax, rdx
.L8:
        lea     rsi, [rax+36+rax*8]
        lea     rdx, [rax+1]
        lea     rax, [rax+9+rsi]               # prepare some stuff to be added
        add     rax, rcx                    # first use of the RCX load result (sum)
        mov     QWORD PTR [rsp-8], rax      # store sum
        cmp     rdx, 100000
        jne     .L9
        ret

因此通过 store/reload 的 loop-carried dep 链仅包含一个 add，1 个周期延迟而不是 3 个。

我假设你的性能比大约是 3:4 左右，从 5+1 个周期 (6) 到 5+3 (8) 个周期。

有关详细信息，请参阅 What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?。

编译器可以在 test1 中花费更多指令来将关键路径延迟减少到 1 个周期，而不是将更多工作折叠到 lea 在关键路径上。对于循环运行 100k 次迭代，这几乎是一个错过的优化。但是我假设优化器没有针对内联 asm 中的 artificially-induced spill/reload 进行调整；通常它只需要引入 store/reload 如果它运行在循环中做很多事情而超出寄存器，并且许多不同的值通常意味着有一些 instruction-level 并行性.

GCC 在 Quick-bench

上从更简单的源代码中获得更好的 asm

@TedLyngmo linked the code on Quick-bench 没有 -fno-tree-vectorize，并使用 benchmark::DoNotOptimize(sum); 这只会强制 GCC 实现寄存器中的值，而不会阻止 constant-propagation 通过它，或者作为许多其他优化。不像获取它的地址并告诉 GCC 这可能是全局可见的，比如当前的自定义 asm。

内部循环体只是 add %rdx,%rax / add [=32=]x4,%rdx（并且 cmp rdx + jne 作为循环 b运行ch），如果您查看 Quickbench 上的 asm。或者在另一个循环中使用 rdx+=10。同样的循环，不同的常量。当然，性能相同。

当前源代码基本上编译成 asm

for (int i = 0 ; i<iterations ; i++){
    sum += 8*i + 1234;
    force_store_reload(&sum);
}

但如果你真的那样写 (https://godbolt.org/z/4ja38or9j)，我们得到的 asm 就像 quick-bench，除了将值保存在内存中而不是寄存器中。（所以慢了大约 6 倍。）

.L6:
        add     QWORD PTR [rsp-8], rax   # memory destination add
        add     rax, 4
        cmp     rax, 401234
        jne     .L6

这似乎是一个 missed-optimization 错误，GCC 没有将您现有的源代码编译成那个。具体来说，从 8*i re-evaluated 中缺少 strength-reduction，每个 i 到 tmp += 8。

顺便说一句，似乎省略 -fno-tree-vectorize 会使您原来的 test1() 编译更糟。它开始时没有跳到循环的中间，但它有更长的 dep 链。

#gcc 10.3 -O3        (without -fno-tree-vectorize)
test1_orig():
        mov     QWORD PTR [rsp-8], 6      # init sum
        lea     rsi, [rsp-8]              # operand for inline asm
        mov     eax, 1
.L2:
        mov     rdx, QWORD PTR [rsp-8]   # load sum
        mov     rcx, rax
        add     rdx, rax                 # used: 1 cycle latency
        add     rax, 1
        add     rdx, rax                 # used: 1 cycle latency
        lea     rdx, [rdx+5+rcx*2]       # used: 3 cycle latency
        mov     QWORD PTR [rsp-8], rdx   # store
        cmp     rax, 100000
        jne     .L2
        ret

当内部循环迭代次数较少时，性能会有显着差异吗？

Performance differ significantly when inner loop iterates less times?

c++

assembly

gcc

x86-64

cpu-architecture

GCC 在 Quick-bench