使用未对齐的缓冲区进行矢量化：使用 VMASKMOVPS：根据未对齐计数生成掩码？或者根本不使用那个insn

Question

gcc 5.3 和 -O3 -mavx -mtune=haswell for x86-64 使得 surprisingly bulky code 可以处理 potentially-misaligned 输入代码，例如：

// convenient simple example of compiler input
// I'm not actually interested in this for any real program
void floatmul(float *a) {
  for (int i=0; i<1024 ; i++)
    a[i] *= 2;
}

clang 使用未对齐的 load/store 指令，但 gcc 使用标量 intro/outro 和对齐的向量循环：它剥离了第一个 up-to-7 未对齐的迭代，将其完全展开到

的序列

    vmovss  xmm0, DWORD PTR [rdi]
    vaddss  xmm0, xmm0, xmm0      ; multiply by two
    vmovss  DWORD PTR [rdi], xmm0
    cmp     eax, 1
    je      .L13
    vmovss  xmm0, DWORD PTR [rdi+4]
    vaddss  xmm0, xmm0, xmm0
    vmovss  DWORD PTR [rdi+4], xmm0
    cmp     eax, 2
    je      .L14
    ...

这看起来很糟糕，尤其是。对于带有 uop 高速缓存的 CPU。我对此进行了 gcc bug 报告，并建议 gcc 在剥离未对齐的迭代时可以使用 smaller/better 代码。不过，它可能仍然不是最佳选择。

这个问题是关于 AVX 实际上什么才是最佳的。我问的是 gcc 和其他编译器 could/should 使用的 general-case 解决方案。（我没有找到任何 gcc 邮件列表中关于此的讨论，但没有花很长时间寻找。）

可能会有多个答案，因为 -mtune=haswell 的最佳选择可能与 -mtune=bdver3（压路机）的最佳选择不同。然后是允许指令集扩展时什么是最佳的问题（例如 AVX2 用于 256b 整数内容，BMI1 用于在更少的指令中将计数转换为位掩码）。

我知道 Agner Fog 的优化装配指南，第 13.5 节访问未对齐的数据和部分向量。他建议要么使用未对齐的访问，在开始 and/or 结束时进行重叠写入，要么从对齐的访问中改组数据（但是 PALIGNR 只需要 imm8 计数，所以 2x pshufb / por).他认为 VMASKMOVPS 没有用，可能是因为它在 AMD 上的表现非常糟糕。我怀疑如果针对 Intel 进行调整，则值得考虑。如何生成正确的掩码并不明显，因此问题标题。

结果可能会更好，就像 clang 那样简单地使用未对齐的访问。对于短缓冲区，对齐的开销可能会扼杀避免主循环缓存行拆分的任何好处。对于大缓冲区，主内存或 L3 作为瓶颈可能会隐藏高速缓存行拆分的惩罚。如果有人有实验数据来支持他们调整过的任何真实代码，那也是有用的信息。

VMASKMOVPS 看起来确实可用于 Intel 目标。（SSE 版本很糟糕，有一个隐含的 non-temporal 提示，但 AVX 版本没有那个。甚至还有一个新的内在函数来确保你没有得到 128b 操作数的 SSE 版本：_mm128_maskstore_ps) AVX 版本为 only a little bit slow on Haswell:

3 uops / 4c 延迟 / 1-per-2c 吞吐量作为负载。
4 微指令/14c 延迟/1-per-2c 吞吐量作为 256b 存储。
4 微指令/13c 延迟/1-per-1c 吞吐量作为 128b 存储。

Jaguar（每 22c tput 1 个）和 Bulldozer-family：在 Steamroller 上每 16c 1 个（类似于 Bulldozer），或在 Piledriver 上每 ~180c 吞吐量 1 个，存储形式在 AMD CPU 上仍然慢得无法使用.

但是如果我们确实想使用 VMASKMOVPS，我们需要一个向量，每个元素的高位设置实际上应该是 loaded/stored。 PALIGNR 和 PSRLDQ（用于 all-ones 的向量）只需要 compile-time-constant 计数。

请注意，其他位并不重要：它不一定是 all-ones，因此可以将一些设置位分散到元素的高位。

Answer 1

您可以使用 is there an inverse instruction to the movemask instruction in intel avx2? 或：

即时将 (1 << (vector1.size() & 3)) - 1 之类的整数位掩码转换为矢量掩码

将 VMOVMASKPS 的掩码从 window 加载到 table。 AVX2 或 AVX1 带有一些额外的指令或更大的 table.

掩码也可用于寄存器中的ANDPS，减少需要对每个元素精确计数一次。正如 Stephen Canon 在对 OP 的评论中指出的那样，流水线加载可以允许重叠的未对齐存储工作，即使对于像我选择的示例那样的就地重写功能，所以 VMASKMOVPS 是 NOT这里是最好的选择。

这应该适用于 Intel CPU，尤其是。 Haswell 及更高版本的 AVX2。

Agner Fog 获取 pshufb 掩码的方法实际上提供了一个非常有效的想法：从 table 中获取 window 数据进行未对齐加载。代替巨大的 table 掩码，使用索引作为对内存中的数据进行字节移位的方式。

掩码以 LSB 优先字节顺序（因为它们存储在内存中），而不是矢量中 {X3,X2,X1,X0} 元素的常用表示法。如所写，它们与对齐的 window 对齐，包括内存中输入数组的 start/end。

开始未对齐计数 = 0：掩码 = 全一（对齐大小写）
start misalign count = 1: mask = {0,-1,-1,-1,-1,-1,-1,-1} (跳过第一个32B)
...
start misalign count = 7: mask = {0, 0, 0, 0, 0, 0, 0,-1}（跳过第一个 32B 中的一个）
结束未对齐计数 = 0：没有尾随元素。 mask = all-ones（大小写对齐）。
这是一个奇怪的情况，与 count=1 不相似。针对这种特殊情况的一些额外说明值得避免额外的循环迭代和使用全零掩码进行清理。
结束未对齐计数 = 1：一个尾随元素。掩码 = {-1, 0, 0, 0, 0, 0, 0, 0}
...
结束未对齐计数 = 7：七个尾随元素。掩码 = {-1,-1,-1,-1,-1,-1,-1, 0}

未经测试的代码，假定存在错误

section .data
align 32  ; preferably no cache-line boundaries inside the table

; byte elements, to be loaded with pmovsx. all-ones sign-extends
    DB  0,  0,  0,  0,  0,  0,  0,  0
masktable_intro:                      ; index with 0..-7
    DB -1, -1, -1, -1, -1, -1, -1, -1
masktable_outro:                      ; index with -8(aligned), or -1..-7
    DB  0,  0,  0,  0,  0,  0,  0,  0

; the very first and last 0 bytes are not needed, since we avoid an all-zero mask.


section .text
global floatmul   ; (float *rdi)
floatmul:
    mov    eax, edi
    and    eax, 0x1c            ; 0x1c = 7 << 2 = 0b11100

    lea    rdx, [rdi + 4096 - 32]  ; one full vector less than the end address (calculated *before* masking for alignment).
                ;;  replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)

    and    rdi, ~0x1c           ; Leave the low 2 bits alone, so this still works on misaligned floats.

    shr    eax, 2               ; misalignment-count, in the range [0..7]

    neg        rax
    vpmovsxbd  ymm0, [masktable_intro + rax]  ; Won't link on OS X: Need a separate LEA for RIP-relative

    vmaskmovps ymm1, ymm0, [rdi]
    vaddps     ymm1, ymm1, ymm1   ; *= 2.0
    vmaskmovps [rdi], ymm0, ymm1

    ;;; also prepare the cleanup mask while the table is still hot in L1 cache

    ; if the loop count known to be a multiple of the vector width,
    ; the alignment of the end will be the same as the alignment of the start
    ; so we could just invert the mask
    ; vpxor    xmm1, xmm1, xmm1     ; doesn't need an execution unit
    ; vpcmpeqd ymm0, ymm1, ymm0

    ; In the more general case: just re-generate the mask from the one-past-the-end addr
    mov    eax, edx
    xor    ecx, ecx      ; prep for setcc
    and    eax, 0x1c     ; sets ZF when aligned
    setz   cl            ; rcx=1 in the aligned special-case, else 0
    shr    eax, 2
    lea    eax, [rax + rcx*8]   ; 1..7, or 8 in the aligned case
    neg    rax
    vpmovsxbd  ymm0, [masktable_outro + rax]


.loop:
    add      rdi, 32
    vmovups  ymm1, [rdi]    ; Or vmovaps if you want to fault if the address isn't 4B-aligned
    vaddps   ymm1, ymm1, ymm1   ; *= 2.0
    vmovups  [rdi], ymm1
    cmp      rdi, rdx           ; while( (p+=8) < (start+1024-8) )
    jb    .loop        ; 5 fused-domain uops, yuck.

    ; use the outro mask that we generated before the loop for insn scheduling / cache locality reasons.

    vmaskmov ymm1, ymm0, [rdi]
    vaddps     ymm1, ymm1, ymm1   ; *= 2.0
    vmaskmovps [rdi], ymm0, ymm1
    ret


    ; vpcmpeqd ymm1, ymm1, ymm1   ; worse way to invert the mask: dep-chain breaker but still needs an execution unit to make all-ones instead of all-zeros.
    ; vpxor    ymm0, ymm0, ymm1

这确实需要从 table 加载，这可能会在 L1 缓存中丢失，以及 15B 的 table 数据。（如果循环计数也是可变的，则为 24B，我们必须单独生成结束掩码）。

无论哪种方式，在生成未对齐计数和对齐起始地址的 4 条指令之后，获取掩码只需要一条 vpmosvsxbd 指令。（ymm，mem 形式不能微融合，所以它是 2 微指令）。这需要 AVX2。

没有 AVX2：

vmovdqu 来自 60B table 双字 (DD) 而不是字节 (DB)。这实际上会保存一个相对于 AVX2 的 insn：address & 0x1c 是索引，而不需要右移两位。整个 table 仍然适合缓存行，但没有空间容纳算法可能使用的其他常量。

或者：

2x vpmovsxbd 到两个 128b 寄存器（[masktable_intro + rax] 和 [masktable_intro + rax + 4]）
vinsertf128

或者：（更多的 insns，更多的 shuffle-port 压力，但更少的 load-port 压力，几乎肯定更糟）

vpmovsxbw 到 128b 寄存器
vpunpcklwd / vpunpckhwd 到两个 xmm regs（两个都是 src1=src2）
vinsertf128

开销：

Integer ops: 5 uops 在开始时获取索引并对齐起始指针。 7 微指令获取结束掩码的索引。如果循环元素计数是矢量宽度的倍数，则除了简单地使用未对齐外，总共有 12 个 GP 寄存器微指令。
AVX2：两个 2-fused-domain-uop 向量 insn 从 GP reg 中的 [0..7] 索引到 YMM reg 中的掩码。（一个用于开始掩码，一个用于结束掩码）。使用 24B table，以字节粒度在 8B window 中访问。
AVX：六个 1-fused-domain-uop 向量 insns（三个在开头，三个在结尾）。对于 table 的 RIP 相对寻址，其中四个指令将是 [base+index] 并且不会微熔断，因此额外的两个整数 insn 可能会更好。

循环内的代码被复制了3次。

TODO：写另一个答案来动态生成掩码，可能作为 64b reg 中的字节，然后将其解压缩为 256b。也许有位移位，或者 BMI2 的 BZHI(-1, count)?

Answer 2

仅限 AVX：start/end 处的未对齐访问，流水线加载以避免就地重写时出现问题。

感谢@StephenCanon 指出这比 VMASKMOVPS 更好，因为 VMASKMOVPS 可以帮助循环未对齐的缓冲区。

这可能有点过分期望编译器将其作为循环转换来执行，尤其是。因为显而易见的方法会让 Valgrind 不高兴（见下文）。

section .text
global floatmul   ; (float *rdi)
floatmul:

    lea    rdx, [rdi + 4096 - 32]  ; one full vector less than the end address (calculated *before* masking for alignment).
                ;;  replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)

    vmovups  ymm0, [rdi]        ; first vector
    vaddps   ymm0, ymm0, ymm0   ; *= 2.0
    ; don't store yet

    lea    rax, [rdi+32]
    and    rax, ~0x1c           ; 0x1c = 7 << 2 = 0b11100  ; clear those bits.
    vmovups  ymm1, [rax]        ; first aligned vector, for use by first loop iteration

    vmovups  [rdi], ymm0        ; store the first unaligned vector
    vmovups  ymm0, [rdx]        ; load the *last* unaligned vector
    
.loop:
      ;; on entry: [rax] is already loaded into ymm1
    vaddps   ymm1, ymm1, ymm1   ; *= 2.0
    vmovups  [rax]              ; vmovaps would fault if p%4 != 0
    add      rax, 32
    vmovups  ymm1, [rax]
    cmp      rax, rdx           ; while( (p+=8) < (endp-8) );
    jb  .loop

    ; discard ymm1.  It includes data from beyond the end of the array (aligned case: same as ymm0)

    vaddps   ymm0, ymm0, ymm0   ; the last 32B, which we loaded before the loop
    vmovups  [rdx], ymm0
    ret

    ;   End alignment:
    ; a[] = XXXX XXXX ABCD E___    _ = garbage past the end
    ;                  ^rdx
    ;       ^rax ^rax ^rax ^rax(loop exit)

    ;   ymm0 = BCDE
    ;   ymm1 loops over ..., XXXX, ABCD, E___
    ;   The last load off the end of the array includes garbage
    ;     because we pipeline the load for the next iteration

在循环开始时从数组末尾进行加载似乎有点奇怪，但希望它不会混淆硬件预取器，或者减慢从内存中获取数组开始的速度。

开销：

2 个额外的整数 uops 总数（用于设置对齐开始）。我们已经在使用普通循环结构的结束指针，所以这是免费的。
2 个额外的循环体副本 (load/calc/store)。（第一次和最后一次迭代去皮）。

自动向量化时，编译器可能不喜欢发出这样的代码。 Valgrind will report accesses outside of array bounds，并通过单步执行和解码指令来查看他们正在访问的内容。因此，仅仅停留在与数组中最后一个元素相同的页面（和缓存行）中是不够的。另请注意，如果输入指针不是 4B 对齐的，我们可能会读入另一个页面并出现段错误。

为了让 Valgrind 满意，我们可以提前停止循环 two 向量宽度，以执行数组未对齐的最后一个向量宽度的特殊情况加载。这将需要额外复制循环体（在本例中无关紧要，但故意这样做是微不足道的。）或者可能通过让介绍代码跳到循环中间来避免流水线操作。（这对于 uop 缓存来说可能不是最优的，但是：（部分）循环体可能会在 uop 缓存中结束两次。）

TODO：编写一个中途跳入循环的版本。

使用未对齐的缓冲区进行矢量化：使用 VMASKMOVPS：根据未对齐计数生成掩码？或者根本不使用那个insn

Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

x86

assembly

gcc

sse

avx

将 VMOVMASKPS 的掩码从 window 加载到 table。 AVX2 或 AVX1 带有一些额外的指令或更大的 table.

没有 AVX2：

开销：

仅限 AVX：start/end 处的未对齐访问，流水线加载以避免就地重写时出现问题。

开销：