如何实现像 memcpy() 这样的快速复制功能？

Question

我看到了一些关于 memcpy() 如何实现比简单的逐字节复制更快的速度的答案。他们中的大多数人提出了以下建议：

void *my_memcpy(void *dest, const void *src, size_t n) {
    uint64_t *d = dest;
    const uint64_t *s = src;
    n /= sizeof(uint64_t);

    while (n--)
        *d++ = *s++;

    return dest;
}

根据我的理解（如果我错了请纠正我）可能会违反 strict aliasing assumption 并导致未定义的行为。为简单起见，假设 n 以及 src 和 dest 的对齐方式和大小是 8 的倍数。

如果 my_memcpy 确实会导致未定义的行为，我想知道 memcpy 如何在不违反任何编译器假设的情况下一次复制多个字节。 x64 的任何有效实现的示例都会有所帮助。

使用库例程的建议无效。我实际上不是在写我自己的 memcpy。我正在编写一个可以使用类似优化的函数，但 AFAIK 在 C 标准中不可用。

Answer 1

memcpy 是编译器可以用内置版本替换的特殊函数，例如如果能证明两个数组不重叠

实际的、快速的实现几乎总是使用汇编程序和特殊的内在函数（例如 glibc SSSE3), but other libc implementations might implement it in C (e.g. musl）。

Answer 2

可移植性，您应该在对齐的基础上进行复制，这不一定uint64_t。理论上，您应该使用 uint_fast8_t 但实际上，在大多数系统上，一个显然是 1 字节大，1 字节对齐。如果不需要便携性，您可以坚持使用 uint64_t.

下一个问题是传递给 memcpy 的指针不一定指向对齐的地址，根据标准函数的要求，无论对齐如何工作。所以你必须做这样的事情：

size_t prealign = (uintptr_t)src % _Alignof(uint64_t);
if(prealign != 0)
{
  // copy bytes up to next aligned address
}

目标相同，数据结尾相同。

which to my understanding (correct me if I'm wrong) can violate the strict aliasing assumption and cause undefined behavior.

正确。因此，为了复制 uint64_t 块，您要么必须在内联汇编程序中编写代码，要么必须在编译时以 non-standard 方式禁用严格别名，例如 gcc -fno-strict-aliasing.

"real" 库 memcpy 被编译器视为特例，许多其他此类库函数也是如此。例如，memcpy(&foo, &bar, sizeof(int)); 将被翻译成单个 mov 指令，内嵌在调用者代码中，根本不会调用 memcpy。

关于指针别名的另一个注意事项是您应该restrict 像使用真正的 memcpy 一样限定指针。这告诉编译器它可以假设 dest 和 src 指针不相同，或者它们重叠，这意味着编译器不需要为该场景添加检查或开销代码.

有趣的是，当我编写以下天真的复制函数时：

#include <stdint.h>
#include <stddef.h>

void foocpy (void* dst, const void* src, size_t n)
{
  uint8_t* u8_dst = dst;
  const uint8_t* u8_src = src;

  for(size_t i=0; i<n; i++)
  {
    u8_dst[i] = u8_src[i];
  }
}

然后编译器给我一大堆相当低效的机器代码。但是如果我简单地将 restrict 添加到两个指针，整个函数将被替换为：

foocpy:
        test    rdx, rdx
        je      .L1
        jmp     memcpy
.L1:
        ret

这再次表明 built-in memcpy 被编译器视为特殊的雪花。

Answer 3

已经详细说明了最重要的要点。

但我要补充一点：如果你用 C 编写代码并且你的编译器比你聪明，它会注意到你写了一个错误的 memcpy 版本并会通过调用来替换它实际的内置 memcpy。例如：

#include <stdlib.h>

void *mymemcpy(void *restrict dest, const void * restrict src, size_t n) {
   char *csrc = (char *)src; 
   char *cdest = (char *)dest; 

   for (size_t i=0; i<n; i++) 
       cdest[i] = csrc[i]; 

   return dest;
}

用 GCC 9.1 编译，生成的程序集是

mymemcpy:
        test    rdx, rdx
        je      .L7
        sub     rsp, 8
        call    memcpy
        add     rsp, 8
        ret
.L7:
        mov     rax, rdi
        ret

那个，假设你不想太聪明...

Answer 4

有效利用特定目标体系结构的特性通常需要使用 non-portable 代码，但标准的作者明确认识到：

C code can be non-portable. [emphasis original] Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler”: the ability to write machine-specific code is one of the strengths of C. It is this principle which largely motivates drawing the distinction between strictly conforming program and conforming program (§4).

分块优化需要使用流行的扩展，几乎所有实现都可以配置为支持。在 gcc 和 clang 中使用 -fno-strict-aliasing 标志启用此扩展可能会产生较差的性能，除非代码在适当的时候使用 restrict 限定符，但这应该归咎于未能正确使用 restrict。 -fno-strict-aliasing 的性能损失在正确使用 restrict 的代码中很小，而不使用 restrict 通常会造成严重的性能损失，即使没有 -fno-strict-aliasing.

如何实现像 memcpy() 这样的快速复制功能？

How can I implement a fast copying function like memcpy()?

c

performance

strict-aliasing

memcpy