clang vs gcc 在 x86_64 上复制 3 个字节 - mov 的数量

Question

就汇编指令而言，使用 memcpy(,,3) 将 3 个字节从一处复制到另一处的优化编译代码应该是什么样子？

考虑以下程序：

#include <string.h>
int main() {
  int* p = (int*) 0x10;
  int x = 0;
  memcpy(&x, p, 4);
  x = x * (x > 1 ? 2 : 3);
  memcpy(p, &x, 4);  
  return 0;
}

这有点做作，会导致分段冲突，但我需要这些说明，以便使用 -O3 进行编译不会使所有这些都消失。当我 compile this（GodBolt，GCC 6.3 -O3）时，我得到：

main:
        mov     edx, DWORD PTR ds:16
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

太棒了 - 从内存到寄存器的单个 mov DWORD（= 4 字节）。尼斯和优化。现在让我们把memcpy(&x, p1, 4)改成memcpy(&x, p1, 3)？编译结果变为：

main:
        mov     DWORD PTR [rsp-4], 0
        movzx   eax, WORD PTR ds:16
        mov     WORD PTR [rsp-4], ax
        movzx   eax, BYTE PTR ds:18
        mov     BYTE PTR [rsp-2], al
        mov     edx, DWORD PTR [rsp-4]
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

我不太了解 Intel X86_64 程序集（阅读：当它很复杂时我什至无法正确阅读），所以 - 我不太明白。我的意思是，我明白了前 6 条指令中发生的事情以及为什么需要这么多指令。为什么两步不够？一个 mov WORD PTR int al 和一个 mov BYTE PTR 变成 ah？

...所以，我来这里问问。当我写这个问题时，我注意到 GodBolt 也有 clang 作为一个选项。好吧，clang (3.9.0 -O3) 是这样做的：

main:                                   # @main
        movzx   eax, byte ptr [18]
        shl     eax, 16
        movzx   ecx, word ptr [16]
        or      ecx, eax
        cmp     ecx, 2
        sbb     eax, eax
        and     eax, 1
        or      eax, 2
        imul    eax, ecx
        mov     dword ptr [16], eax
        xor     eax, eax
        ret

这看起来更符合我的预期。如何解释差异？

备注：

如果我不初始化 x = 0。
其他 GCC 版本与 GCC 6.3 做同样的事情，但 GCC 7 减少到 5 个而不是 6 个 mov。
其他版本的 clang（从 3.4 开始）做同样的事情。

如果我们放弃 memcpy 的行为是相似的：

#include <string.h>

typedef struct {
  unsigned char data[3];
}  uint24_t;

int main() {
  uint24_t* p = (uint24_t*) 0x30;
  int x = 0;
  *((uint24_t*) &x) = *p;
  x = x * (x > 1 ? 2 : 3);
  *p = *((uint24_t*) &x);
  return 0;
}

如果您想与相关代码在函数中时发生的情况进行对比，请查看this or the uint24_t struct version (GodBolt). Then have a look at what happens for 4-byte values。

Answer 1

大小三是一个丑陋的大小，编译器并不完美。

编译器无法生成对您未请求的内存位置的访问，因此需要两步。

虽然这对您来说似乎微不足道，但请记住您要求 memcpy(&x, p, 4); 这是一个副本从内存到内存。
显然 GCC 和旧版本的 Clang 不够聪明，无法弄清楚没有理由传递内存中的临时文件。

GCC 对前六个指令所做的基本上是按照您的要求在 [rsp-4] 处用三个字节构造一个 DWORD

mov     DWORD PTR [rsp-4], 0              ;DWORD is 0

movzx   eax, WORD PTR ds:16               ;EAX = byte 0 and byte 1
mov     WORD PTR [rsp-4], ax              ;DWORD has byte 0 and byte 1

movzx   eax, BYTE PTR ds:18               ;EAX = byte 2
mov     BYTE PTR [rsp-2], al              ;DWORD has byte 0, byte 1 and byte 2

mov     edx, DWORD PTR [rsp-4]            ;As previous from henceon

它正在使用 movzx eax, ... 来防止部分寄存器停顿。

编译器已经通过省略对 memcpy 的调用而做得很好，正如你所说的，这个例子是 "a bit contrived" 可以遵循的，即使对于人类来说也是如此。 memcpy 优化必须适用于任何尺寸，包括那些无法容纳寄存器的尺寸。每次都做对并不容易。

考虑到 L1 访问延迟在最近的架构中已经大大降低并且 [rsp-4] 很可能在缓存中，我不确定是否值得在 GCC 源代码中搞乱优化代码.
错过优化当然值得filing a bug，看看开发人员怎么说。

Answer 2

你应该通过复制 4 个字节并屏蔽掉顶部的字节来获得更好的代码，例如x & 0x00ffffff。这让编译器知道它可以读取 4 个字节，而不仅仅是 C 源代码读取的 3 个字节。

是的，这有很大帮助：它使 gcc 和 clang 无需存储 4B 零，然后复制三个字节并重新加载 4。它们只需加载 4、屏蔽、存储和使用仍在寄存器中的值。部分原因可能是不知道 *p 是否是 *q 的别名。

int foo(int *p, int *q) {
  //*p = 0;
  //memcpy(p, q, 3);
  *p = (*q)&0x00ffffff;
  return *p;
}

    mov     eax, DWORD PTR [rsi]     # load
    and     eax, 16777215            # mask
    mov     DWORD PTR [rdi], eax     # store
    ret                              # and leave it in eax as return value

Why aren't two moves sufficient? A mov WORD PTR into al followed by a mov BYTE PTR into ah?

AL和AH是8位寄存器。您不能将 16 位字放入 AL。这就是为什么你的最后一个 clang-output 块加载两个单独的寄存器并与 shift+or 合并，在它知道它允许弄乱 x.

的所有 4 个字节的情况下

如果您要合并两个单独的单字节值，您可以将它们加载到 AL 和 AH，然后使用 AX，但这会导致 Intel pre-Haswell 上的部分寄存器停止。

您可以将字加载到 AX（或者出于各种原因最好将 movzx 加载到 eax 中，包括正确性和避免对 EAX 的旧值的错误依赖），左移 EAX，然后将字节加载到AL.

但是编译器不倾向于这样做，因为部分寄存器的东西多年来一直是非常糟糕的 juju，并且只在最近的 CPU（Haswell，也许还有 IvyBridge）上有效。它会在 Nehalem 和 Core2 上造成严重的停顿。（参见 Agner Fog's microarch pdf; search for partial-register or look for it in the index. See other links in the x86 标签 wiki。）也许几年后，-mtune=haswell 将启用部分寄存器技巧来保存 clang 用于合并的 OR 指令。

而不是写这样一个人为的函数：

编写带有 args 和 return 值的函数，这样您就不必为了不优化而让它们变得超级奇怪。例如一个接受两个 int* args 并在它们之间执行 3 字节 memcpy 的函数。

This on Godbolt（使用 gcc 和 clang），颜色突出显示

void copy3(int *p, int *q) { memcpy(p, q, 3); }

 clang3.9 -O3 does exactly what you expected: a byte and a word copy.
    mov     al, byte ptr [rsi + 2]
    mov     byte ptr [rdi + 2], al
    movzx   eax, word ptr [rsi]
    mov     word ptr [rdi], ax
    ret

为了得到你设法产生的愚蠢，首先将目标归零，然后在复制三字节后读回：

int foo(int *p, int *q) {
  *p = 0;
  memcpy(p, q, 3);
  return *p;
}

  clang3.9 -O3
    mov     dword ptr [rdi], 0       # *p = 0
    mov     al, byte ptr [rsi + 2]
    mov     byte ptr [rdi + 2], al   # byte copy
    movzx   eax, word ptr [rsi]
    mov     word ptr [rdi], ax       # word copy
    mov     eax, dword ptr [rdi]     # read the whole thing, causing a store-forwarding stall
    ret

gcc 并没有做得更好（除了在不重命名部分 regs 的 CPU 上，因为它通过对字节副本也使用 movzx 避免了对 EAX 旧值的错误依赖） .

Answer 3

（不是真正的答案，因为我无法在其他人已经回答的内容中添加任何内容，所以只是举例说明我将如何手动编写此类代码......可能主要是出于我自己的好奇心）

如果函数是：

f(24b 无符号 n):

f(0) → 0
f(1) → 3
f(n) → n*2, n > 1

（从你的问题来看，我认为是这个）。

然后我会手写汇编（nasm语法）这样做：

    mov     eax,[16]    ; reads 4 bytes from address 16

    ; f(n) starts here, n = low 24b of eax, modifies edx
    xor     edx,edx
    and     eax,0x00FFFFFF
    dec     eax
    setz    dl
    lea     eax,[edx+2*eax+2]
    ; output = low 24b of eax, b24..b31 undefined

    ; writes 3 bytes back to address 16
    mov     [16],ax
    shr     eax,16
    mov     [18],al

clang vs gcc 在 x86_64 上复制 3 个字节 - mov 的数量

clang vs gcc for copying 3 bytes on x86_64 - number of mov's

assembly

gcc

clang

memcpy

compiler-optimization

而不是写这样一个人为的函数：