SSE/SIMD 以一字节元素大小/粒度移动？

Question

如您所知，我们在 SIMD SSE 中有以下移位指令：PSLL(W-D-Q) 和 PSRL(W-D-Q)

没有PSLLB指令，那么我们如何移动8位值的向量（单字节）？

Answer 1

在left-shift-by-one的special-case中可以使用paddb xmm0, xmm0.

正如 Jester 在评论中指出的那样，模拟 non-existent psrlb 和 psllb 的最佳选择是使用更宽的移位，然后屏蔽掉任何跨越元素边界的位.

例如

    psrlw   xmm0, 2       ; doesn't matter what size (w/d/q): performance is the same for all sizes on all CPUs
    pand    xmm0, [mask_right2]

section .rodata
  align 16
    ;; required mask depends on the shift count
    mask_right2: times 16  db 0xff >> 2      (16 bytes of 0x3f)

或者以其他方式将 0x3f 广播到循环之前的向量寄存器中，例如从内存中的双字 vpbroadcastd 或 vbroadcastss，从 qword 中的 SSE3 movddup，或者只是movdqa 矢量负载。（vpbroadcastb 需要额外的 ALU 微指令，这与 dword 或更广泛的广播不同，它们只是简单的加载）。或者 pcmpeqd xmm0,xmm0 / psrlw xmm0, 8+2 / packuswb xmm0,xmm0。通过正确选择移位计数，您可以生成 2ⁿ-1 字节的任何模式（重复零，然后重复一个）。

mov r32, imm32 / movd xmm, r32 和 shuffle 也是一个选项，但与 pcmpeqw / ... 序列相比可能不会节省指令字节。（请注意，VBROADCASTSS 的 register-source 版本仅适用于 AVX2，这在这里无关紧要，因为 256b 整数移位也仅适用于 AVX2。）

对于 variable-count 向量移位，在整数寄存器中创建掩码并将其广播到向量是一种选择（使用 pshufb 和 all-zero 寄存器广播低字节，或使用 imul eax, eax, 0x01010101 从字节到双字 movd + pshufd）。您还可以使用 pcmpeqd 方法创建一个 all-ones 向量并使用 psrlw xmm0, xmm1 然后 pack 或 pshufb.

我没有看到任何类似的有效方法来模拟算术 right-shift（non-existant PSRAB）。 PSRAW正确处理了每个字的高字节。将每个字的低字节移至高位将使另一个 PSRAW 复制其符号位所需的次数不限。

;; vpblendvb is 2 uops on Intel so this is worse throughput in loops than the pxor/paddb version
;; Latency may be the same on Skylake because this has some ILP.

; input in xmm0.  Using AVX to save on mov instructions
VPSLLDQ   xmm1, xmm0, 1      ; or VPSLLW xmm1, xmm0, 8, but this distributes one of the uops to the shuffle port
VPSRAW    xmm1, xmm1, 8+2    ; shift low bytes back to final destination

VPSRAW    xmm0, xmm0, 2      ; shift high bytes, leaving garbage in low bytes
VPBLENDVB xmm0, xmm1, xmm0, xmm2  ; (where xmm2 holds a mask of alternating 0 and -1, which could be generated with pcmpeqw / psrlw 8).  This insn is fairly slow

字节粒度没有immediate-blend，因为单个立即字节只能编码8个元素

没有 VPBLENDVB（即使它可用，如果生成或加载常量很慢，也可能更好）：

;; Probably worse than the PXOR/PADDB version, if 2 constants are cheap to load
;; Needs no vector constants, but this is inefficient vs. versions with constants.
VPSLLDQ   xmm1, xmm0, 1      ; or VPSLLW 8
VPSRAW    xmm1, xmm1, n      ; low bytes in the wrong place

VPSRAW    xmm0, xmm0, 8+n    ; shift high bytes all the way to the bottom of the element
VPSLLW    xmm0, xmm0, 8      ; high bytes back in place, with zero in the low byte.  (VPSLLDQ can't work: PSRAW 8+n leaves garbage we need to clear)

VPSRLW    xmm1, xmm1, 8      ; shift low bytes into place, leaving zero in the high byte.  (VPSRLDQ 1 could do this, if we started with VPSLLW instead of VPSLLDQ)
VPOR      xmm0, xmm0, xmm1

在寄存器中使用带有常量（交替 0/-1 字节）的 PAND/PANDN/POR 也可以（对移位端口的压力要小得多）来执行 byte-blend，并且是如果您必须循环执行此操作，则更好的选择。

Sign-extending一个窄值变成一个字节的剩余部分：

假设每个字节是zero-extended，例如使用 AND + shift/AND 将半字节解包为字节后。（适用于任何字段宽度，只需调整常量即可。）

使用 XOR 翻转高零和符号位。将 1 添加到符号位，这样它将恢复正确的符号位，并通过进位传播清除高位（如果它变为 0 并执行）或保留它们设置（如果它变为 1 并且没有进位）。

; hoist the constants out of a loop if you're looping, of course.
; input in XMM0, upper bits of each byte already zeroed 
    pxor   xmm0,  [const_0xf8]     ;   1111 s'xxx
    paddb  xmm0,  [const_0x08]     ;   0000 0xxx   or  1111 1xxx

用它来模拟缺失的 `psrab`

这仍然可能只用内存中的 2 个常量。这很可能是循环的最佳选择，特别是如果您有备用寄存器来提升这些常量的负载。（0xf0 可以与 vpandn 一起使用来隔离低半字节，如果你也需要的话。）

    psrld  xmm0,  4                              ;   ???? sxxx   (s = sign bit, xxx = lower bits)
    por    xmm0,  xmm5     ; set1_epi8(0xf0)     ;   1111 sxxx

    pxor   xmm0,  xmm6     ; set1_epi8(0x08)     ;   1111 s'xxx
    paddb  xmm0,  xmm6     ; set1_epi8(0x08)     ;   0000 0xxx   or  1111 1xxx

我认为我们无法避免使用 2 个独立的布尔值。我们需要 PXOR 来对抗 PADDB 或 PSUBB 翻转符号位，但只有 POR 可以设置位，而不管它们的旧值如何。

我们可以在添加或减去（pand + pslld + paddb）之前隔离符号位和 left-shift 它，但那样会更糟，尤其是没有 AVX 的 3 操作数指令以避免 movdqa。这也将是更多的指令，包括我们仍然需要的 POR。

优点：

可以在任何矢量 ALU 端口上运行的简单指令。
Intel 上的微指令少于 vpblendvb 版本。

缺点：

没有 ILP（Instruction-level 并行），所以延迟可能不会比 vpblendvb 版本更好，尤其是在 AMD Zen / Zen2 上，其中 vpblendvb 是 single-uop 指令只有 1c 延迟。
需要 2 个向量常量。

sign-extension 字段 <=4 位使用 PSHUFB table 查找

而不是 pxor / paddb，使用 pshufb 根据低 4 位为每个字节查找一个新值。不幸的是，如果选择器设置了高位，pshufb 会将通道归零，因此我们不能在原始 psrld 结果上使用它，这些结果可能已经移动到 non-zero 高位。

const __m128i sext_lut = _mm_setr_epi8( 0,  1,  2,  3,  4,  5,  6,  7,
                                       -8, -7, -6, -5, -4, -3, -2, -1);
return _mm_shuffle_epi8(sext_lut, v);

对于 3 操作数的 AVX non-destructive，这可以是在寄存器中重复使用查找 table 的单个指令。如果没有，它将需要 movdqa 来复制 LUT。

换档：

__m128i srai_4_epi8(__m128i v) {
    v = _mm_srli_epi32(v, 4);
    v = _mm_and_si128(v, _mm_set1_epi8(0x0f));
  const __m128i sext_lut = _mm_setr_epi8( 0,  1,  2,  3,  4,  5,  6,  7,
                                         -8, -7, -6, -5, -4, -3, -2, -1);
    return _mm_shuffle_epi8(sext_lut, v);
}

Answer 2

这是另一种模拟 "psrab" 的方法，它适用于带有 1 个临时寄存器的 SSE 或 AVX：

  __ punpckhbw(scratch, src);  // junk in low bytes
  __ punpcklbw(dst, src);      // junk in low bytes
  __ psraw(scratch, 8 + shift);
  __ psraw(dst, 8 + shift);
  __ packsswb(dst, scratch);   // pack words to get result

SSE/SIMD 以一字节元素大小/粒度移动？

SSE/SIMD shift with one-byte element size / granularity?

x86

assembly

sse

bit-shift

Sign-extending一个窄值变成一个字节的剩余部分：

用它来模拟缺失的 `psrab`

sign-extension 字段 <=4 位使用 PSHUFB table 查找

SSE/SIMD 以一字节元素大小/粒度移动？

SSE/SIMD shift with one-byte element size / granularity?

x86

assembly

sse

bit-shift

Sign-extending一个窄值变成一个字节的剩余部分：

用它来模拟缺失的 psrab

sign-extension 字段 <=4 位使用 PSHUFB table 查找

用它来模拟缺失的 `psrab`