gcc 中奇怪的自动矢量化在 godbolt 上有不同的结果

Question

我对自动矢量化结果感到困惑。下面的代码addtest.c

#include <stdio.h>
#include <stdlib.h>

#define ELEMS 1024

int
main()
{
  float data1[ELEMS], data2[ELEMS];
  for (int i = 0; i < ELEMS; i++) {
    data1[i] = drand48();
    data2[i] = drand48();
  }
  for (int i = 0; i < ELEMS; i++)
    data1[i] += data2[i];
  printf("%g\n", data1[ELEMS-1]); 
  return 0;
}

由

用gcc 11.1.0编译

gcc-11 -O3 -march=haswell -masm=intel -save-temps -o addtest addtest.c

添加循环被自动矢量化为

.L3:
    vmovaps ymm1, YMMWORD PTR [r12]
    vaddps  ymm0, ymm1, YMMWORD PTR [rax]
    add r12, 32
    add rax, 32
    vmovaps YMMWORD PTR -32[r12], ymm0
    cmp r12, r13
    jne .L3

这很清楚：从 data1 加载，从 data2 加载并添加，存储到 data1，然后在两者之间推进索引。

如果我将相同的代码传递给 https://godbolt.org、select x86-64 gcc-11.1 和选项 -O3 -march=haswell，我得到以下汇编代码：

.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3

一个令人惊讶的事情是不同的地址处理，但让我完全困惑的是 [rbp-8240] 的额外存储。据我所知，这个位置再也没有被使用过。

如果我select gcc 7.5在godbolt上，多余的商店就会消失（但从8.1以上，它会产生）。

所以我的问题是：

为什么我的编译器和 Godbolt 之间存在差异（不同的地址处理，多余的存储）？
多余的商店有什么作用？

非常感谢您的帮助！

Answer 1

difference-maker 是 -fpie，它在大多数发行版中默认打开，但不是 Godbolt。 这没有多大意义，但是编译器是复杂的机器，并不“聪明”。

它也不特定于 -march=haswell 或 AVX；同样的差异发生在 -O3.

Godbolt 使用比发行版更简单的选项配置 GCC，例如没有 default-pie，也没有 -fstack-protector-strong。要在本地匹配 Godbolt，至少使用 -fno-pie -no-pie -fno-stack-protector。可能还有其他我忘记了。

IDK 为什么这会触发或避免 missed-optimization，但我可以确认它在我的带有 GCC 11.1 的 Arch GNU/Linux 系统上确实如此。

本地 gcc -O3 -march=haswell -fno-stack-protector -fno-pie
（和 -masm=intel -S -o- vec.c | less）它匹配 Godbolt：

.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3

但是 distro-configured GCC 默认来自 -O3 -march=haswell:

.L3:
        vmovaps ymm1, YMMWORD PTR [r12]
        vaddps  ymm0, ymm1, YMMWORD PTR [rax]
        add     r12, 32
        add     rax, 32
        vmovaps YMMWORD PTR -32[r12], ymm0
        cmp     r12, r13
        jne     .L3

同样的missed-opt没有-march=haswell；我们将 movaps XMMWORD PTR [rsp], xmm1 存储到循环内的固定地址。（由于 GCC 不需要 over-align 堆栈来溢出 32 字节向量，因此它没有使用 RBP 作为帧指针。）

无明显原因，使用 -fpie on the Godbolt compiler explorer gets GCC to use two pointer increments instead of indexed addressing modes, also avoiding the redundant store. (Making the same asm you get locally). -fpie forces GCC to do that for arrays in static storage (because [arr + rax] would require the symbol address as a 32-bit absolute: 32-bit absolute addresses no longer allowed in x86-64 Linux?)

您可以而且应该在 GCC's bugzilla 上使用关键字“missed-optimization”进行报告。

gcc 中奇怪的自动矢量化在 godbolt 上有不同的结果

weird auto-vectorization in gcc with different results on godbolt

c

gcc

avx

auto-vectorization

godbolt