广播 DWORD 到 YMM

Question

我只是想知道下面的代码是否：

mov eax, r9d    ; eax = j
mul n           ; eax = n * j
shl eax, 2      ; eax = 4 * n * j
                ; now I want to 'broadcast' this to YMM, like so:
                ; ymm = { eax, eax, eax, eax, eax, eax, eax, eax }

  ; This requires AVX512, not just AVX2
  ; vpbroadcastd ymm7, eax

  movd xmm7, eax             ; therefore I must do this workaround?
  vpbroadcastd ymm7, xmm7    ; and finally, the result

能否以某种方式对其进行简化或优化？

Answer 1

是的，如果您没有 AVX512，对于 Intel 和 AMD CPU，vmovd + vpbroadcastd 是正常方法。

我看到 2 个优化：

将 mul n 替换为 imul r9d, n，因为无论如何您都没有使用乘法结果的 EDX 高半部分。 2-operand imul r32, r/m32 在所有现代 CPU 上都是一个 uop； mul r/m32 需要多个。 https://uops.info/ https://agner.org/optimize/。（当然，如果 n 是立即数，imul eax, r9d, n*4）。

在 movd xmm7, eax 上使用 VEX 前缀。即 vmovd xmm7, eax. 如果在 legacy-SSE movd 写入 xmm7 时任何 YMM 寄存器的上半部分脏了，它将触发 AVX-SSE 转换惩罚在哈斯韦尔和冰湖上。（包含 HSW/ICL 和 SKL 使用的不同策略的详细信息。）

如果没有 AVX512，是的，它需要一个 uop（如 movd 指令）将数据从 GP-integer 域传输到 SIMD 域，而且那个 uop 也不能广播。然后你需要另一个 uop 来洗牌。

正如@chtz 指出的那样，如果英特尔 CPU 后端的端口 5 压力是包含此循环的主要瓶颈（而不是总 front-end uops 或延迟），您可以 mov 存储（例如到堆栈）并 vpbroadcastd 重新加载。

vmovd xmm, r32 和 vpbroadcastd 都只能在 Intel CPU 的端口 5 上运行。但是存储是 micro-fused p237 + p4，broadcast-load（32 位或更宽的元素）纯粹在加载端口处理，不需要 ALU uop，所以总成本仍然是 2 front-end 微指令在 Intel CPU 上，成本为 p237+p4 + p23。而不是 2p5。 Store-forwarding ~5 或 6 个周期的延迟实际上类似于 1 到 3 个周期 vmovd + 3 个周期 vpbroadcastd 所以也许这值得考虑对于来自寄存器的 32 位和 64 位广播，如果 load/store 端口没有太大压力。

（可能包括 SSE3 movddup broadcast-loads 到 XMM 寄存器中，尽管 in-lane 洗牌只有 1 个周期延迟所以 movd + xmm 洗牌在 Haswell 上只有大约 4 个周期延迟和稍后。）

测量 movd xmm, r / movd r, xmm 往返的延迟很容易，但很难确定哪条指令有哪条延迟。它们可能只是 1 个周期的 ALU 加上 Skylake 上的 1 个周期的旁路延迟。 Haswell 显然在每个方向上都有 1 个周期 movd。 https://uops.info/ just measures a not-very-tight upper bound on latency by putting it in a loop with instructions to create a loop-carried dependency, and assuming others have 1-cycle latency. https://agner.org/optimize/ makes a guess on how to split up latency for a pair of instructions. Perhaps one could do better by including store-forwarding for one direction and an ALU transfer for the other, but store-forwarding latency on Sandybridge-family is notoriously variable, faster if you don't try to reload right away. (e.g. useless stores can speed up the critical path through a store-forwarding bottleneck. )。并且不能假定整数存储和 vmovd xmm 重新加载之间的 store-forwarding 具有与整数重新加载相同的延迟。

Skylake 的 movd xmm<->eax 往返总共有 4 个周期延迟，高于 Sandybridge / Haswell 中的 2 个。这可能是 2 和 2 有旁路延迟，或者 1 和 3 没有告诉我们哪个方向更慢。

Zen 的是 6 个周期，所以每个方向可能是 3 个周期。

AVX512F vpbroadcastd ymm, r32是single-uop（端口5），所以有AVX512就好了

广播 DWORD 到 YMM

Broadcasting DWORD to YMM

assembly

simd

avx

micro-optimization