为什么 p1007r0 std::assume_aligned 不需要结语？

Question

我的 understanding 是代码的向量化工作方式如下：

对于数组中第一个地址是 128（或 256 或任何 SIMD 指令要求）的倍数的数组中的数据，逐个元素处理缓慢。我们称这个为序言吧。

对于数组中第一个地址是 128 的倍数和最后一个地址是 128 的倍数之间的数据，使用 SIMD 指令。

对于128的倍数的最后一个地址和数组末尾之间的数据，使用逐元素处理的慢速方式。让我们称之为结语。

现在我明白了为什么 std::assume_aligned 有助于序言，但我不明白为什么它使编译器也可以删除尾声。

提案引述：

If we could make this property visible to the compiler, it could skip the loop prologue and epilogue

Answer 1

这在文档本身的第 5 节中进行了讨论：

A function that returns a pointer T* , and guarantees that it will point to over-aligned memory, could return like this:
T* get_overaligned_ptr()
{
// code...
return std::assume_aligned<N>(_data);
}
This technique can be used e.g. in the begin() and end() implementations of a class wrapping an over-aligned range of data. As long as such functions are inline, the over-alignment will be transparent to the compiler at the call-site, enabling it to perform the appropriate optimisations without any extra work by the caller.

begin() 和 end() 方法是过度对齐缓冲区 _data 的数据访问器。也就是说，begin() returns 指向缓冲区第一个字节的指针和 end() returns 指向缓冲区最后一个字节后一个字节的指针。

假设它们定义如下：

T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return _data + size; // No alignment hint!
}

在这种情况下，编译器可能无法消除结语。但是如果有定义如下：

T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return std::assume_aligned<N>(_data + size);
}

那么编译器就可以去掉尾声了。例如，如果 N 是 128 位，则缓冲区的每个 128 位块都保证是 128 位对齐的。请注意，这仅在缓冲区大小是对齐的倍数时才有可能。

Answer 2

您可以看到使用 GNU C/C++ 对代码生成的影响 __builtin_assume_aligned。

gcc 7 和更早的目标 x86（和 ICC18）更喜欢使用标量序言来到达对齐边界，然后是对齐的向量循环，然后是标量结尾来清理不是的倍数的任何剩余元素一个完整的向量。

考虑这样一种情况，在编译时已知元素总数是矢量宽度的倍数，但对齐方式未知。如果您知道对齐，您不需要序言或结语。但如果没有，你需要两者。 最后一个 aligned 向量之后的剩余元素数未知。

此 Godbolt compiler explorer link 显示了使用 ICC18、gcc7.3 和 clang6.0 为 x86-64 编译的这些函数。 clang 展开 very 积极，但仍然使用未对齐的商店。对于 just 存储的循环来说，这似乎是一种花费那么多代码大小的奇怪方式。

// aligned, and size a multiple of vector width
void set42_aligned(int *p) {
    p = (int*)__builtin_assume_aligned(p, 64);
    for (int i=0 ; i<1024 ; i++ ) {
        *p++ = 0x42;
    }
}

 # gcc7.3 -O3   (arch=tune=generic for x86-64 System V: p in RDI)

    lea     rax, [rdi+4096]              # end pointer
    movdqa  xmm0, XMMWORD PTR .LC0[rip]  # set1_epi32(0x42)
.L2:                                     # do {
    add     rdi, 16
    movaps  XMMWORD PTR [rdi-16], xmm0
    cmp     rax, rdi
    jne     .L2                          # }while(p != endp);
    rep ret

这几乎就是我手动要做的，除了可能展开 2，这样 OoO exec 可以发现循环出口分支在仍在咀嚼商店时未被采用。

因此未对齐的版本包括序言和结语：

// without any alignment guarantee
void set42(int *p) {
    for (int i=0 ; i<1024 ; i++ ) {
        *p++ = 0x42;
    }
}

~26 instructions of setup, vs. 2 from the aligned version

.L8:            # then a bloated loop with 4 uops instead of 3
    add     eax, 1
    add     rdx, 16
    movaps  XMMWORD PTR [rdx-16], xmm0
    cmp     ecx, eax
    ja      .L8               # end of main vector loop

 # epilogue:
    mov     eax, esi    # then destroy the counter we spent an extra uop on inside the loop.  /facepalm
    and     eax, -4
    mov     edx, eax
    sub     r8d, eax
    cmp     esi, eax
    lea     rdx, [r9+rdx*4]   # recalc a pointer to the last element, maybe to avoid a data dependency on the pointer from the loop.
    je      .L5
    cmp     r8d, 1
    mov     DWORD PTR [rdx], 66      # fully-unrolled final up-to-3 stores
    je      .L5
    cmp     r8d, 2
    mov     DWORD PTR [rdx+4], 66
    je      .L5
    mov     DWORD PTR [rdx+8], 66
.L5:
    rep ret

即使对于更复杂的循环，gcc 也不会展开主矢量化循环，但会在完全展开的标量上花费大量代码 prologue/epilogue。对于具有 uint16_t 元素之类的 AVX2 256 位矢量化来说，这真的很糟糕。（prologue/epilogue 中最多 15 个元素，而不是 3 个）。这不是明智的权衡，因此它有助于 gcc7 和更早版本显着地告诉它指针何时对齐。（执行速度变化不大，但是对于减少代码膨胀有很大的不同。）

顺便说一句，gcc8 倾向于使用未对齐的 loads/stores，假设数据通常是对齐的。现代硬件具有廉价的未对齐 16 和 32 字节 loads/stores，因此让硬件处理跨高速缓存行边界拆分的 loads/stores 的成本通常是好的。（AVX512 64 字节存储通常值得对齐，因为任何未对齐都意味着缓存行在每次访问时拆分，而不是每隔一次或每 4 次。）

另一个因素是，与在 start/end 处执行一个未对齐的潜在重叠向量的智能处理相比，早期 gcc 的完全展开标量 prologues/epilogues 是垃圾。 ()。如果 gcc 知道如何做到这一点，就值得更频繁地对齐。

为什么 p1007r0 std::assume_aligned 不需要结语？

Why does p1007r0 std::assume_aligned remove the need for epilogue?

c++

simd

memory-alignment

compiler-optimization

auto-vectorization