为什么 clang 使 Quake 快速反平方根代码比 GCC 快 10 倍？（带有 (long)float 类型的双关语）

Question

我正在尝试对 fast inverse square root 进行基准测试。完整代码在这里：

#include <benchmark/benchmark.h>
#include <math.h>

float number = 30942;
    
static void BM_FastInverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        // from wikipedia:
        long i;
        float x2, y;
        const float threehalfs = 1.5F;

        x2 = number * 0.5F;
        y  = number;
        i  = * ( long * ) &y;
        i  = 0x5f3759df - ( i >> 1 );
        y  = * ( float * ) &i;
        y  = y * ( threehalfs - ( x2 * y * y ) );
        //  y  = y * ( threehalfs - ( x2 * y * y ) );
        
        float result = y;
        benchmark::DoNotOptimize(result);
    }
}


static void BM_InverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        float result = 1 / sqrt(number);
        benchmark::DoNotOptimize(result);
    } 
}

BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);

和here是quick-bench中的代码，如果你想自己运行的话。

使用 GCC 11.2 和 -O3 编译，BM_FastInverseSqrRoot 比 Noop 慢 31 倍左右（当我在我的机器上本地运行时大约 10 ns）。使用 Clang 13.0 和 -O3 编译，它比 Noop 慢 3.6 倍左右（当我在我的机器上本地运行时大约 1 ns）。这是 10 倍的速度差异。

这是相关的程序集（取自 quick-bench）。

使用 GCC：

               push   %rbp
               mov    %rdi,%rbp
               push   %rbx
               sub    [=11=]x18,%rsp
               cmpb   [=11=]x0,0x1a(%rdi)
               je     408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
               callq  40a770 <benchmark::State::StartKeepRunning()>
  408c84       add    [=11=]x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
               jmpq   40aa20 <benchmark::State::FinishKeepRunning()>
               nopw   0x0(%rax,%rax,1)
  408c98       mov    0x10(%rdi),%rbx
               callq  40a770 <benchmark::State::StartKeepRunning()>
               test   %rbx,%rbx
               je     408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
               movss  0x1b386(%rip),%xmm4        # 424034 <_IO_stdin_used+0x34>
               movss  0x1b382(%rip),%xmm3        # 424038 <_IO_stdin_used+0x38>
               mov    [=11=]x5f3759df,%edx
               nopl   0x0(%rax,%rax,1)
   408cc0      movss  0x237a8(%rip),%xmm0        # 42c470 <number>
               mov    %edx,%ecx
               movaps %xmm3,%xmm1
        2.91%  movss  %xmm0,0xc(%rsp)
               mulss  %xmm4,%xmm0
               mov    0xc(%rsp),%rax
        44.70% sar    %rax
        3.27%  sub    %eax,%ecx
        3.24%  movd   %ecx,%xmm2
        3.27%  mulss  %xmm2,%xmm0
        9.58%  mulss  %xmm2,%xmm0
        10.00% subss  %xmm0,%xmm1
        10.03% mulss  %xmm2,%xmm1
        9.64%  movss  %xmm1,0x8(%rsp)
        3.33%  sub    [=11=]x1,%rbx
               jne    408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
               add    [=11=]x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
  408d0a       jmpq   40aa20 <benchmark::State::FinishKeepRunning()>

使用 Clang：

           push   %rbp
           push   %r14
           push   %rbx
           sub    [=12=]x10,%rsp
           mov    %rdi,%r14
           mov    0x1a(%rdi),%bpl
           mov    0x10(%rdi),%rbx
           call   213a80 <benchmark::State::StartKeepRunning()>
           test   %bpl,%bpl
           jne    212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           test   %rbx,%rbx
           je     212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           movss  -0xf12e(%rip),%xmm0        # 203cec <_IO_stdin_used+0x8>
           movss  -0xf13a(%rip),%xmm1        # 203ce8 <_IO_stdin_used+0x4>
           cs nopw 0x0(%rax,%rax,1)
           nopl   0x0(%rax)
 212e30 2.46%  movd   0x3c308(%rip),%xmm2        # 24f140 <number>
        4.83%  movd   %xmm2,%eax
        8.07%  mulss  %xmm0,%xmm2
        12.35% shr    %eax
        2.60%  mov    [=12=]x5f3759df,%ecx
        5.15%  sub    %eax,%ecx
        8.02%  movd   %ecx,%xmm3
        11.53% mulss  %xmm3,%xmm2
        3.16%  mulss  %xmm3,%xmm2
        5.71%  addss  %xmm1,%xmm2
        8.19%  mulss  %xmm3,%xmm2
        16.44% movss  %xmm2,0xc(%rsp)
        11.50% add    [=12=]xffffffffffffffff,%rbx
               jne    212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
 212e69        mov    %r14,%rdi
               call   213af0 <benchmark::State::FinishKeepRunning()>
               add    [=12=]x10,%rsp
               pop    %rbx
               pop    %r14
               pop    %rbp
  212e79       ret

他们看起来和我很像。两者似乎都在像 mulss 一样使用 SIMD registers/instructions。 GCC 版本有一个 sar 应该占 46%？（但我认为它只是贴错了标签，mulss, mov, sar 加起来占了 46%）。无论如何，我对 Assembly 还不够熟悉，无法真正判断是什么导致了如此巨大的性能差异。

有人知道吗？

Answer 1

仅供参考， - 不，已被 SSE1 废弃 rsqrtss，您可以使用或不使用牛顿迭代。

正如人们在评论中指出的那样，您使用的是 64 位 long（因为这是 non-Windows 系统上的 x86-64），将其指向 32 位 float。因此，除了 strict-aliasing 违规（使用 memcpy 或 std::bit_cast<int32_t>(myfloat) 进行类型双关），这也是性能和正确性的障碍。

您的 perf report 输出证实了这一点； GCC 正在对堆栈执行 32 位 movss %xmm0,0xc(%rsp) 存储，然后重新加载 64 位 mov 0xc(%rsp),%rax，这将导致 a store forwarding stall 花费更多的额外延迟。以及吞吐量损失，因为实际上您是在测试吞吐量，而不是延迟：下一次计算平方反比只有一个常量输入，而不是上一次迭代的结果。（benchmark::DoNotOptimize 包含一个 "memory" 破坏器，它阻止 GCC/clang 将大部分计算提升到循环之外；他们必须假设 number 可能已经改变，因为它不是 const.)

等待加载结果的指令（sar）是那些周期的罪魁祸首， 像往常一样。（当在 cycles 事件计数器环绕时触发中断以收集样本时，CPU 必须找出一条指令导致该事件。通常这最终是等待更早的指令慢指令，或者即使没有数据依赖性也可能只是在慢指令之后，我忘记了。）

Clang 选择假设高 32 位为零，因此 movd %xmm0, %eax 仅使用 ALU uop 复制寄存器，而 shr 而不是 sar 因为它知道它从它假装使用的 64 位 long 的高半部分移入零。（一个函数调用仍然使用 %rdi 所以这不是 Windows clang。）

错误修复版本：GCC 和 clang 生成类似的 asm

修复问题中 quick-bench link 上的代码以使用 int32_t 和 std::bit_cast，https://godbolt.org/z/qbxqsaW4e 显示 GCC 和 clang 编译与 -Ofast，虽然不完全相同。例如GCC 加载 number 两次，一次加载到整数寄存器中，一次加载到 XMM0 中。 Clang 加载一次并使用 movd eax, xmm2 获取它。

在 QB (https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8) 上，现在 GCC 的 BM_FastInverseSqrRoot 比原始版本快 2 倍，没有 -ffast-math

是的，由于 C++ 从 sqrt(float) 推断出 sqrtf，天真的基准测试编译为 sqrtss / divss 而没有 -ffast-math。它每次都会检查数字是否为 >=0，因为 quick-bench 不允许使用 -fno-math-errno 进行编译以忽略该检查以调用 libm 函数。但是那个分支预测得很好，所以循环应该仍然很容易成为端口 0 吞吐量（div/sqrt 单位）的瓶颈。

Quick-bench 允许 -Ofast，相当于 -O3 -ffast-math，它使用 rsqrtss 和牛顿迭代。（使用 FMA 会更快，但是 quick-bench 不允许 -march=native 或任何东西。我想可以使用 __attribute__((target("avx,fma"))).

Quick-bench 现在给出 Error or timeout 无论我是否使用它，以及 权限错误映射页面。 并建议更小的 -m/--mmap_pages所以我无法在该系统上进行测试。

带有牛顿迭代的 rsqrt（就像编译器在 -Ofast 中使用的那样）可能更快或类似于 Quake 的快速 invsqrt，但精度约为 23 位。

为什么 clang 使 Quake 快速反平方根代码比 GCC 快 10 倍？（带有 (long)float 类型的双关语）

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with (long)float type punning)

c++

assembly

gcc

x86-64

clang

错误修复版本：GCC 和 clang 生成类似的 asm

为什么 clang 使 Quake 快速反平方根代码比 GCC 快 10 倍？ （带有 *(long*)float 类型的双关语）

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)float type punning)

c++

assembly

gcc

x86-64

clang

错误修复版本：GCC 和 clang 生成类似的 asm

为什么 clang 使 Quake 快速反平方根代码比 GCC 快 10 倍？（带有 (long)float 类型的双关语）

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with (long)float type punning)