为什么 clang 没有用 memmove 替换这个循环

Question

考虑这个 memcpy 类似的函数：

void copy(unsigned *restrict const dst, unsigned const *restrict const src, unsigned long n)
{
    for (unsigned long x = 0; x < n; ++x)
    {
        dst[x] = src[x];
    }
}

demo

此代码针对 memcpy 进行了很好的优化：

copy:
        cbz     x2, .L1
        lsl     x2, x2, 2
        b       memcpy
.L1:
        ret

但是，当我删除 restrict 时，clang 应用循环矢量化并 does not replace 它带有 memmove。这是为什么？

我尝试在启用优化报告的情况下进行编译：

clang-10 main.c -c -O3 -fsave-optimization-record -S && cat ./main.opt.yaml

这就是我用 restrict:

得到的结果

--- !Passed
Pass:            loop-idiom
Name:            ProcessLoopStoreOfLoopLoad
DebugLoc:        { File: main.c, Line: 4, Column: 12 }
Function:        copy
Args:
  - String:          'Formed a call to '
  - NewFunction:     llvm.memcpy.p0i8.p0i8.i64
  - String:          '() function'
...

和w/o restrict:

--- !Passed
Pass:            loop-vectorize
Name:            Vectorized
DebugLoc:        { File: main.c, Line: 3, Column: 3 }
Function:        copy
Args:
  - String:          'vectorized loop (vectorization width: '
  - VectorizationFactor: '4'
  - String:          ', interleaved count: '
  - InterleaveCount: '2'
  - String:          ')'
...

优化器直接进入循环矢量化，跳过 ProcessLoopStoreOfLoopLoad，不打印任何消息。这是为什么？为什么不能将此代码替换为 memmove？

Answer 1

这是关于数组之间发生碰撞时操作的可观察效果。
例如：

1 2 3 4

如果 src 指向 1，dst 指向 2，结果应该是

1 1 1 1

另一方面，Memmove，在重叠的情况下这样做：

The memory areas may overlap: copying takes place as though the bytes in src are first copied into a temporary array that does not overlap src or dest, and the bytes are then copied from the temporary array to dest.

即这种复制的结果将是：

1 1 2 3

与原始代码有显着差异的地方。

另外，如果你写这个 memmove 类似的代码：

#include <stdlib.h>

void copy(unsigned *const dst, unsigned const *const src, unsigned long n)
{
    unsigned *tmp = malloc(n * sizeof(*tmp));
    for (unsigned long x = 0; x < n; ++x)
    {
        tmp[x] = src[x];
    }

    for (unsigned long x = 0; x < n; ++x)
    {
        dst[x] = tmp[x];
    }

    free(tmp);
}

clang 会很好地用 memmove:

替换它

copy:                                   # @copy
        testq   %rdx, %rdx
        je      .LBB0_2
        pushq   %rax
        shlq    , %rdx
        callq   memmove@PLT
        addq    , %rsp
.LBB0_2:
        retq

demo

Answer 2

Why can't this code be replaced with memmove?

相反：memmove()被优化为向量化循环。

所以memmcpy()最快，SIMD次之，memmove()第三。

（在 gcc 上测试，似乎与 OP 中的 clang 完全相同。）

组合 2xrestrict 加上 memmove 导致 memcpy。这都是可能重叠的问题。

为什么 clang 没有用 memmove 替换这个循环

Why clang does not replace this loop with a memmove

c

clang