为什么 FMA _mm256_fmadd_pd() 内在函数有 3 个 asm 助记符,"vfmadd132pd"、“231”和“213”?
Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?
有人可以向我解释为什么融合乘法累加指令有 3 种变体:vfmadd132pd
、vfmadd231pd
和 vfmadd213pd
,而只有一种 C 内在函数 _mm256_fmadd_pd
?
为简单起见,(在 AT&T 语法中)
之间有什么区别
vfmadd132pd %ymm0, %ymm1, %ymm2
vfmadd231pd %ymm0, %ymm1, %ymm2
vfmadd213pd %ymm0, %ymm1, %ymm2
我没有从 Intel's intrinsics guide 那里得到任何想法。我问是因为我在我编写的一段 C 代码的汇编程序输出中看到了所有这些。谢谢。
一个干净的答案(重新格式化下面的答案)
对于变体ijk
,vfmaddijkpd
的含义:
- 英特尔语法:
op(i) * op(j) + op(k) -> op(1)
- AT&T 语法:
op(4-i) * op(4-j) + op(4-k) -> op(3)
其中op(n)
表示指令后的第n个操作数。所以两者之间有一个reverse变换:
n <- 4 - n
这是在程序集中instruction set reference, and also in HTML extracts of it, like the entry for VFMADD*PD:
VFMADD132PD: Multiplies the two or four packed double-precision
floating-point values from the first source operand to the two or
four packed double-precision floating-point values in the third source
operand, adds the infinite precision intermediate result to the two
or four packed double-precision floating-point values in the second
source operand, performs rounding and stores the resulting two or four
packed double-precision floating-point values to the destination
operand (first source operand).
VFMADD213PD: Multiplies the two or
four packed double-precision floating-point values from the second
source operand to the two or four packed double-precision
floating-point values in the first source operand, adds the infinite
precision intermediate result to the two or four packed
double-precision floating-point values in the third source operand,
performs rounding and stores the resulting two or four packed
double-precision floating-point values to the destination operand
(first source operand).
VFMADD231PD: Multiplies the two or four packed
double-precision floating-point values from the second source to the
two or four packed double-precision floating-point values in the third
source operand, adds the infinite precision intermediate result to
the two or four packed double-precision floating-point values in the
first source operand, performs rounding and stores the resulting two
or four packed double-precision floating-point values to the desti-
nation operand (first source operand).
融合乘加指令将两个(压缩的)值相乘,添加第三个值,然后用结果覆盖其中一个值。三个值中只有一个可以是内存操作数而不是寄存器。
它的工作方式是所有三个指令都覆盖ymm0
并且只允许ymm2
作为内存操作数。指令的选择决定了两个操作数相乘和相加。
假设 ymm0 是 Intel 语法中的第一个操作数(或 AT&T 语法中的最后一个):
vfmadd132pd: ymm0 = ymm0 * ymm2/mem + ymm1
vfmadd231pd: ymm0 = ymm1 * ymm2/mem + ymm0
vfmadd213pd: ymm0 = ymm1 * ymm0 + ymm2/mem
使用 C 内在函数时,此选择不是必需的:内在函数不会覆盖值,而是 returns 它的结果,并且它允许从内存中读取所有三个值。如果需要,编译器将添加内存reads/writes,如果不希望三个值中的任何一个被覆盖,将分配一个临时寄存器来存储结果。它会根据需要选择三个指令之一。
有人可以向我解释为什么融合乘法累加指令有 3 种变体:vfmadd132pd
、vfmadd231pd
和 vfmadd213pd
,而只有一种 C 内在函数 _mm256_fmadd_pd
?
为简单起见,(在 AT&T 语法中)
之间有什么区别vfmadd132pd %ymm0, %ymm1, %ymm2
vfmadd231pd %ymm0, %ymm1, %ymm2
vfmadd213pd %ymm0, %ymm1, %ymm2
我没有从 Intel's intrinsics guide 那里得到任何想法。我问是因为我在我编写的一段 C 代码的汇编程序输出中看到了所有这些。谢谢。
一个干净的答案(重新格式化下面的答案)
对于变体ijk
,vfmaddijkpd
的含义:
- 英特尔语法:
op(i) * op(j) + op(k) -> op(1)
- AT&T 语法:
op(4-i) * op(4-j) + op(4-k) -> op(3)
其中op(n)
表示指令后的第n个操作数。所以两者之间有一个reverse变换:
n <- 4 - n
这是在程序集中instruction set reference, and also in HTML extracts of it, like the entry for VFMADD*PD:
VFMADD132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).
VFMADD213PD: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).
VFMADD231PD: Multiplies the two or four packed double-precision floating-point values from the second source to the two or four packed double-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the desti- nation operand (first source operand).
融合乘加指令将两个(压缩的)值相乘,添加第三个值,然后用结果覆盖其中一个值。三个值中只有一个可以是内存操作数而不是寄存器。
它的工作方式是所有三个指令都覆盖ymm0
并且只允许ymm2
作为内存操作数。指令的选择决定了两个操作数相乘和相加。
假设 ymm0 是 Intel 语法中的第一个操作数(或 AT&T 语法中的最后一个):
vfmadd132pd: ymm0 = ymm0 * ymm2/mem + ymm1
vfmadd231pd: ymm0 = ymm1 * ymm2/mem + ymm0
vfmadd213pd: ymm0 = ymm1 * ymm0 + ymm2/mem
使用 C 内在函数时,此选择不是必需的:内在函数不会覆盖值,而是 returns 它的结果,并且它允许从内存中读取所有三个值。如果需要,编译器将添加内存reads/writes,如果不希望三个值中的任何一个被覆盖,将分配一个临时寄存器来存储结果。它会根据需要选择三个指令之一。