结合 restrict 和 attribute((aligned(32)))

Question

我想确保 gcc 知道：

指针指向不重叠的内存块
指针有 32 字节对齐

以下是正确的吗？

template<typename T, typename T2>
void f(const  T* __restrict__ __attribute__((aligned(32))) x,
       T2* __restrict__ __attribute__((aligned(32))) out) {}

谢谢。

更新：

我尝试使用一次读取和大量写入来使 cpu 端口饱和以进行写入。我希望这将使对齐移动带来的性能提升更加显着。

但是程序集仍然使用未对齐的移动而不是对齐的移动。

代码（也在godbolt.org）

int square(const  float* __restrict__ __attribute__((aligned(32))) x,
           const int size,
           float* __restrict__ __attribute__((aligned(32))) out0,
           float* __restrict__ __attribute__((aligned(32))) out1,
           float* __restrict__ __attribute__((aligned(32))) out2,
           float* __restrict__ __attribute__((aligned(32))) out3,
           float* __restrict__ __attribute__((aligned(32))) out4) {
    for (int i = 0; i < size; ++i) {
        out0[i] = x[i];
        out1[i] = x[i] * x[i];
        out2[i] = x[i] * x[i] * x[i];
        out3[i] = x[i] * x[i] * x[i] * x[i];
        out4[i] = x[i] * x[i] * x[i] * x[i] * x[i];
    }
}

使用 gcc 8.2 和“-march=haswell -O3”编译的程序集它充满了 vmovups，这是未对齐的移动。

.L3:
        vmovups ymm1, YMMWORD PTR [rbx+rax]
        vmulps  ymm0, ymm1, ymm1
        vmovups YMMWORD PTR [r14+rax], ymm0
        vmulps  ymm0, ymm1, ymm0
        vmovups YMMWORD PTR [r15+rax], ymm0
        vmulps  ymm0, ymm1, ymm0
        vmovups YMMWORD PTR [r12+rax], ymm0
        vmulps  ymm0, ymm1, ymm0
        vmovups YMMWORD PTR [rbp+0+rax], ymm0
        add     rax, 32
        cmp     rax, rdx
        jne     .L3
        and     r13d, -8
        vzeroupper

即使对于 sandybridge 也有相同的行为：

.L3:
        vmovups xmm2, XMMWORD PTR [rbx+rax]
        vinsertf128     ymm1, ymm2, XMMWORD PTR [rbx+16+rax], 0x1
        vmulps  ymm0, ymm1, ymm1
        vmovups XMMWORD PTR [r14+rax], xmm0
        vextractf128    XMMWORD PTR [r14+16+rax], ymm0, 0x1
        vmulps  ymm0, ymm1, ymm0
        vmovups XMMWORD PTR [r13+0+rax], xmm0
        vextractf128    XMMWORD PTR [r13+16+rax], ymm0, 0x1
        vmulps  ymm0, ymm1, ymm0
        vmovups XMMWORD PTR [r12+rax], xmm0
        vextractf128    XMMWORD PTR [r12+16+rax], ymm0, 0x1
        vmulps  ymm0, ymm1, ymm0
        vmovups XMMWORD PTR [rbp+0+rax], xmm0
        vextractf128    XMMWORD PTR [rbp+16+rax], ymm0, 0x1
        add     rax, 32
        cmp     rax, rdx
        jne     .L3
        and     r15d, -8
        vzeroupper

使用加法代替乘法 (godbolt)。仍然未对齐的移动。

Answer 1

不，使用float *__attribute__((aligned(32))) x意味着指针本身存储在对齐内存中，而不是指向对齐内存。¹

有一种方法可以做到这一点，但它只对 gcc 有帮助，对 clang 或 ICC 没有帮助。

有关 __attribute__((aligned(32))) 的更多详细信息，请参阅 How to tell GCC that a pointer argument is always double-word-aligned? for __builtin_assume_aligned which works on all GNU C compatible compilers, and ，它确实适用于 GCC。

我使用 __restrict 而不是 __restrict__ 因为 C99 restrict 的 C++ 扩展名称可移植到所有主流 x86 C++ 编译器，包括 MSVC。

typedef float aligned32_float __attribute__((aligned(32)));

void prod(const aligned32_float  * __restrict x,
          const aligned32_float  * __restrict y,
          int size,
          aligned32_float* __restrict out0)
{
    size &= -16ULL;

#if 0   // this works for clang, ICC, and GCC
    x = (const float*)__builtin_assume_aligned(x, 32);  // have to cast the result in C++
    y = (const float*)__builtin_assume_aligned(y, 32);
    out0 = (float*)__builtin_assume_aligned(out0, 32);
#endif

    for (int i = 0; i < size; ++i) {
        out0[i] = x[i] * y[i];  // auto-vectorized with a memory operand for mulps
      // note clang using two separate movups loads
      // instead of a memory operand for mulps
    }
}

(gcc, clang, and ICC output on the Godbolt compiler explorer).

GCC 和 clang 将在具有编译时对齐保证的任何时候使用 movaps / vmovaps 而不是 ups 。（与从不使用 movaps 代替 loads/stores 的 MSVC 和 ICC 不同，错过了对在 Core2 / K10 或更早版本上运行的任何内容的优化）。正如您所注意到的，它正在将 -mavx256-split-unaligned-load/store 效果应用于除 Haswell () 之外的调音，这是您的语法无效的另一个线索。

vmovups 在对齐内存上使用时不是性能问题；当地址在运行时对齐时，它在所有支持 AVX 的 CPU 上的性能与 vmovaps 相同。所以在实践中，你的 -march=haswell 输出没有真正的问题。只有较旧的 CPU，在 Nehalem 和 Bulldozer 之前，总是将 movups 解码为多个 uops。

告诉编译器有关对齐保证的真正好处（如今）是编译器有时会为 startup/cleanup 循环发出额外代码以达到对齐边界。或者如果没有 AVX，编译器无法将负载折叠到 mulps 的内存操作数中，除非它已对齐。

一个很好的测试用例是 out0[i] = x[i] * y[i]，其中加载结果只需要一次。 或 out0[i] *= x[i]。知道对齐启用 movaps/mulps xmm0, [rsi]，否则它是 2x movups + mulps。您甚至可以在像 ICC 或 MSVC 这样的编译器上检查此优化，这些编译器使用 movups，即使它们 do 知道它们具有对齐保证，但它们仍然会要求对齐当他们可以将负载折叠到 ALU 操作中时编写代码。

似乎 __builtin_assume_aligned 是唯一真正可移植（对 GNU C 编译器）的方法。您可以像将指针传递给 struct aligned_floats { alignas(32) float f[8]; }; 那样进行 hack，但这使用起来很麻烦，除非您实际通过该类型的对象访问内存，否则编译器不会假定对齐。（例如，将指向它的指针投射回 float *

I try to use one read and lots of write to saturate the cpu ports for writing.

使用超过 4 个输出流可能会导致缓存中出现更多冲突未命中，从而造成伤害。例如，Skylake 的 L2 缓存只有 4 路。但是 L1d 是 8 向的，所以你可能适合小缓冲区。

如果要使存储端口 uop 吞吐量饱和，请使用较窄的存储（例如标量），而不是每个 uop 需要更多带宽的宽 SIMD 存储。在提交到 L1d 之前，对同一缓存行的背靠背存储可能能够合并到存储缓冲区中，因此这取决于您要测试的内容。

半相关：像 c[i] = a[i]+b[i] 或 STREAM 三元组这样的 2x 加载 + 1x 存储内存访问模式将最接近 Intel Sandybridge 系列 CPU 上的 maxing out total L1d cache load+store bandwidth。在 SnB/IvB 上，256 位向量每个 load/store 需要 2 个周期，为存储地址 uops 在加载的第二个周期期间使用端口 2 或 3 上的 AGU 留出时间。在 Haswell 及更高版本（256 位宽 load/store 端口）上，存储需要使用非索引寻址模式，以便它们可以在端口 7 上使用简单寻址模式存储 AGU。

但是 AMD CPU 每个时钟最多可以执行 2 次内存操作，最多一次是存储，因此它们会使用复制和操作存储 = 加载模式来最大化。

顺便说一句，英特尔最近宣布了 Sunny Cove（Ice Lake 的继任者），每个时钟将有 2x load + 2x store throughput，第二个向量随机 ALU，和 5-wide issue/rename。所以这很有趣！编译器需要将循环展开至少 2 个，以免在每个时钟 1 个循环分支上出现瓶颈。

脚注 1：这就是为什么（如果您在没有 AVX 的情况下编译），您会收到警告，并且 gcc 会忽略 and rsp,-32，因为它假定 RSP 已经对齐。（它实际上并没有溢出任何 YMM regs，所以它应该已经优化了这一点，但是 gcc 已经有一段时间没有优化错误，本地或自动矢量化创建的对象具有额外的对齐。）

<source>:4:6: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6

结合 restrict 和 attribute((aligned(32)))

Combining restrict and attribute((aligned(32)))

c++

x86

gcc

memory-alignment

restrict-qualifier

结合 __restrict__ 和 __attribute__((aligned(32)))

Combining __restrict__ and __attribute__((aligned(32)))

c++

x86

gcc

memory-alignment

restrict-qualifier

结合 restrict 和 attribute((aligned(32)))

Combining restrict and attribute((aligned(32)))