用 4 个独立的双打加载 x64 ymm 寄存器的有效方法是什么？

Question

加载 x64 ymm 寄存器的最有效方法是什么

4 个均匀间隔的双打，即一组连续的双打

0  1  2  3  4  5  6  7  8  9 10 .. 100
And i want to load for example 0, 10, 20, 30

4个在任何位置加倍

i.e. i want to load for example 1, 6, 22, 43

Answer 1

我认为您必须寻找类似 VGATHERQPD 的 GATHER 操作。

该指令有条件地从内存操作数（第二个操作数）指定的内存地址加载最多 2 个或 4 个双精度浮点值并使用 qword 索引。内存操作数使用 SIB 字节的 VSIB 形式指定一个通用寄存器操作数作为公共基数，一个向量寄存器用于相对于基数的索引数组和一个常数比例因子。

请注意，这需要 AVX2，因此不适用于具有 AVX 但不具有 AVX2 的 Sandy Bridge/Ivy Bridge。

Answer 2

最简单的方法是 VGATHERQPD，它是 Haswell 及更高版本上可用的 AVX2 指令。

VGATHERQPD ymm1, [rsi+xmm7*8], ymm2

Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.

一条指令就可以做到这一点。这里 ymm2 是掩码寄存器，其最高位指示是否应将值复制到 ymm1（保持不变）。 ymm7 包含具有比例因子的元素的索引。

因此应用于您的示例，在 MASM 语法中可能如下所示：

4 doubles evenly spaced i.e. a contiguous set of doubles

0 1 2 3 4 5 6 7 8 9 10 .. 100 --- And i want to load for example 0, 10, 20, 30

.data
  .align 16
  qqIndices dq 0,10,20,30
  dpValues  REAL8 0,1,2,3, ... 100
.code
  lea rsi, dpValues
  movapd ymm7, qqIndices
  vpcmpeqw ymm1, ymm1                     ; set to all ones
  vgatherqpd ymm0, [rsi+xmm7*8], ymm1

现在 ymm0 包含四个双打 0、10、20、30。虽然，我还没有测试过这个。另一件事是，这不一定是每种情况下最快的选择。这些值都是单独收集的，也就是说，每个值都需要一次内存访问，参见 How are the gather instructions in AVX2 implemented

所以根据Mysticial's comment

I recently had to do something that required a true gather-load. (i.e. data[index[i]]). On Haswell, 4 index loads + 2x movsd + 2x movhpd + vinsertf128 is still significantly faster than a ymm load + vgatherqpd. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though.

最快的方法就是使用这种方法。

因此，OpCode 方式中的“高效”将使用 VGATHER，而与执行时间相关的“高效”将是最后一个（到目前为止，让我们看看未来的架构将如何执行）。

编辑：根据评论，VGATHER 指令在 Broadwell 和 Skylake 上变得更快。

用 4 个独立的双打加载 x64 ymm 寄存器的有效方法是什么？

What efficient way to load x64 ymm register with 4 seperated doubles?

assembly

x86

64-bit

x86-64

simd