gcc 是否使用 v* 汇编指令给出 worse/slower 代码？

Question

考虑这个简单的循环：

float f(float x[]) {
  float p = 1.0;
  for (int i = 0; i < 128; i++)
    p += x[i];
  return p;
}

如果你在 gcc 中使用 -O2 -march=haswell 编译它，你会得到：

    f:
            vmovss  xmm0, DWORD PTR .LC0[rip]
            lea     rax, [rdi+512]
    .L2:
            vaddss  xmm0, xmm0, DWORD PTR [rdi]
            add     rdi, 4
            cmp     rdi, rax
            jne     .L2
            ret
    .LC0:
            .long   1065353216

但是，英特尔 C 编译器给出：

f:
        xor       eax, eax                                      #3.3
        pxor      xmm0, xmm0                                    #2.11
        movaps    xmm7, xmm0                                    #2.11
        movaps    xmm6, xmm0                                    #2.11
        movaps    xmm5, xmm0                                    #2.11
        movaps    xmm4, xmm0                                    #2.11
        movaps    xmm3, xmm0                                    #2.11
        movaps    xmm2, xmm0                                    #2.11
        movaps    xmm1, xmm0                                    #2.11
..B1.2:                         # Preds ..B1.2 ..B1.1
        movups    xmm8, XMMWORD PTR [rdi+rax*4]                 #4.10
        movups    xmm9, XMMWORD PTR [16+rdi+rax*4]              #4.10
        movups    xmm10, XMMWORD PTR [32+rdi+rax*4]             #4.10
        movups    xmm11, XMMWORD PTR [48+rdi+rax*4]             #4.10
        movups    xmm12, XMMWORD PTR [64+rdi+rax*4]             #4.10
        movups    xmm13, XMMWORD PTR [80+rdi+rax*4]             #4.10
        movups    xmm14, XMMWORD PTR [96+rdi+rax*4]             #4.10
        movups    xmm15, XMMWORD PTR [112+rdi+rax*4]            #4.10
        addps     xmm0, xmm8                                    #4.5
        addps     xmm7, xmm9                                    #4.5
        addps     xmm6, xmm10                                   #4.5
        addps     xmm5, xmm11                                   #4.5
        addps     xmm4, xmm12                                   #4.5
        addps     xmm3, xmm13                                   #4.5
        addps     xmm2, xmm14                                   #4.5
        addps     xmm1, xmm15                                   #4.5
        add       rax, 32                                       #3.3
        cmp       rax, 128                                      #3.3
        jb        ..B1.2        # Prob 99%                      #3.3
        addps     xmm0, xmm7                                    #2.11
        addps     xmm6, xmm5                                    #2.11
        addps     xmm4, xmm3                                    #2.11
        addps     xmm2, xmm1                                    #2.11
        addps     xmm0, xmm6                                    #2.11
        addps     xmm4, xmm2                                    #2.11
        addps     xmm0, xmm4                                    #2.11
        movaps    xmm1, xmm0                                    #2.11
        movhlps   xmm1, xmm0                                    #2.11
        addps     xmm0, xmm1                                    #2.11
        movaps    xmm2, xmm0                                    #2.11
        shufps    xmm2, xmm0, 245                               #2.11
        addss     xmm0, xmm2                                    #2.11
        addss     xmm0, DWORD PTR .L_2il0floatpacket.0[rip]     #2.11
        ret                                                     #5.10
.L_2il0floatpacket.0:
        .long   0x3f800000

如果我们忽略循环展开，最明显的区别是gcc使用vaddss而icc使用addss。

Is there a performance difference between these two pieces of assembly and which one is better (ignoring the loop unrolling)?

v 前缀来自 VEX coding scheme。看来您可以通过添加 -xavx 作为命令行标志的一部分来让 icc 使用这些指令。但是，问题仍然存在，如果问题中两组组件之间存在任何性能差异，或者其中一组是否比另一组有任何优势。

Answer 1

助记符前缀为v的指令为VEX编码指令。 VEX 编码方案允许对每条 SSE 指令以及新的 AVX 指令和一些其他指令进行编码。遗留指令和 VEX 编码指令之间几乎存在 1:1 对应关系，但存在以下差异：

VEX 编码的 SSE 指令隐式地将 ymm 寄存器的高 128 位清零，对应于指令中使用的 xmm 寄存器操作数。如果先前的指令在这些位中留下数据，这可以避免代价高昂的部分寄存器更新。
VEX 编码方案允许指令有一个额外的输出操作数，而不是覆盖其中一个输入操作数。这减少了寄存器压力并允许编译器生成更少的数据移动，从而略微提高了性能。
AVX 指令只能使用 VEX 前缀进行编码，因为 256 位数据宽度无法通过任何其他方式进行通信。

gcc 是否使用 v* 汇编指令给出 worse/slower 代码？

Does gcc give worse/slower code using v* assembly instructions?

c

assembly

gcc

icc