为什么这个 SSE2 程序（整数）生成 movaps（浮点数）？

Question

以下循环将一个整数矩阵转置为另一个整数矩阵。当我有趣地编译时，它会生成 movaps 指令以将结果存储到输出矩阵中。为什么 gcc 这样做？

数据：

int __attribute__(( aligned(16))) t[N][M]  
  , __attribute__(( aligned(16))) c_tra[N][M];

循环次数：

for( i=0; i<N; i+=4){
    for(j=0; j<M; j+=4){

        row0 = _mm_load_si128((__m128i *)&t[i][j]);
        row1 = _mm_load_si128((__m128i *)&t[i+1][j]);
        row2 = _mm_load_si128((__m128i *)&t[i+2][j]);
        row3 = _mm_load_si128((__m128i *)&t[i+3][j]);

        __t0 = _mm_unpacklo_epi32(row0, row1);
        __t1 = _mm_unpacklo_epi32(row2, row3);
        __t2 = _mm_unpackhi_epi32(row0, row1);
        __t3 = _mm_unpackhi_epi32(row2, row3);

        /* values back into I[0-3] */
        row0 = _mm_unpacklo_epi64(__t0, __t1);
        row1 = _mm_unpackhi_epi64(__t0, __t1);
        row2 = _mm_unpacklo_epi64(__t2, __t3);
        row3 = _mm_unpackhi_epi64(__t2, __t3);

        _mm_store_si128((__m128i *)&c_tra[j][i], row0);
        _mm_store_si128((__m128i *)&c_tra[j+1][i], row1);
        _mm_store_si128((__m128i *)&c_tra[j+2][i], row2);
        _mm_store_si128((__m128i *)&c_tra[j+3][i], row3);



    }
}

程序集生成的代码：

.L39:
    lea rcx, [rsi+rdx]
    movdqa  xmm1, XMMWORD PTR [rdx]
    add rdx, 16
    add rax, 2048
    movdqa  xmm6, XMMWORD PTR [rcx+rdi]
    movdqa  xmm3, xmm1
    movdqa  xmm2, XMMWORD PTR [rcx+r9]
    punpckldq   xmm3, xmm6
    movdqa  xmm5, XMMWORD PTR [rcx+r10]
    movdqa  xmm4, xmm2
    punpckhdq   xmm1, xmm6
    punpckldq   xmm4, xmm5
    punpckhdq   xmm2, xmm5
    movdqa  xmm5, xmm3
    punpckhqdq  xmm3, xmm4
    punpcklqdq  xmm5, xmm4
    movdqa  xmm4, xmm1
    punpckhqdq  xmm1, xmm2
    punpcklqdq  xmm4, xmm2
    movaps  XMMWORD PTR [rax-2048], xmm5
    movaps  XMMWORD PTR [rax-1536], xmm3
    movaps  XMMWORD PTR [rax-1024], xmm4
    movaps  XMMWORD PTR [rax-512], xmm1
    cmp r11, rdx
    jne .L39

gcc -Wall -msse4.2 -masm="intel" -O2 -c -S skylake linuxmint

-mavx2 或 -march=naticve 生成 VEX 编码：vmovaps.

Answer 1

这些指令在功能上是相同的。我不喜欢复制粘贴其他人的陈述作为我的陈述，所以解释它的链接很少：

Difference between MOVDQA and MOVAPS x86 instructions?

https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279587

http://masm32.com/board/index.php?topic=1138.0

https://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/

简短版本：

So for the most part, you should try to use the move instruction that corresponds with the operations you are going to use on those registers. However, there is an additional complication. Loads and stores to and from memory execute on a separate port from the integer and floating point units; thus instructions that load from memory into a register or store from a register into memory will experience the same delay regardless of the data type you attach to the move. Thus in this case, movaps, movapd, and movdqa will have the same delay no matter what data you use. Since movaps (and movups) is encoded in binary form with one less byte than the other two, it makes sense to use it for all reg-mem moves, regardless of the data type.

原来是GCC优化

为什么这个 SSE2 程序（整数）生成 movaps（浮点数）？

Why this SSE2 program (integers) generate movaps (float)?

x86

assembly

gcc

sse

simd