为什么 clang 没有用 memmove 替换这个循环

Why clang does not replace this loop with a memmove

考虑这个 memcpy 类似的函数:

void copy(unsigned *restrict const dst, unsigned const *restrict const src, unsigned long n)
{
    for (unsigned long x = 0; x < n; ++x)
    {
        dst[x] = src[x];
    }
}

demo

此代码针对 memcpy 进行了很好的优化:

copy:
        cbz     x2, .L1
        lsl     x2, x2, 2
        b       memcpy
.L1:
        ret

但是,当我删除 restrict 时,clang 应用循环矢量化并 does not replace 它带有 memmove。这是为什么?

我尝试在启用优化报告的情况下进行编译:

clang-10 main.c -c -O3 -fsave-optimization-record -S && cat ./main.opt.yaml

这就是我用 restrict:

得到的结果
--- !Passed
Pass:            loop-idiom
Name:            ProcessLoopStoreOfLoopLoad
DebugLoc:        { File: main.c, Line: 4, Column: 12 }
Function:        copy
Args:
  - String:          'Formed a call to '
  - NewFunction:     llvm.memcpy.p0i8.p0i8.i64
  - String:          '() function'
...

和w/o restrict:

--- !Passed
Pass:            loop-vectorize
Name:            Vectorized
DebugLoc:        { File: main.c, Line: 3, Column: 3 }
Function:        copy
Args:
  - String:          'vectorized loop (vectorization width: '
  - VectorizationFactor: '4'
  - String:          ', interleaved count: '
  - InterleaveCount: '2'
  - String:          ')'
...

优化器直接进入循环矢量化,跳过 ProcessLoopStoreOfLoopLoad,不打印任何消息。这是为什么?为什么不能将此代码替换为 memmove

这是关于数组之间发生碰撞时操作的可观察效果。
例如:

1 2 3 4

如果 src 指向 1,dst 指向 2,结果应该是

1 1 1 1

另一方面,Memmove,在重叠的情况下这样做:

The memory areas may overlap: copying takes place as though the bytes in src are first copied into a temporary array that does not overlap src or dest, and the bytes are then copied from the temporary array to dest.

即这种复制的结果将是:

1 1 2 3

与原始代码有显着差异的地方。

另外,如果你写这个 memmove 类似的代码:

#include <stdlib.h>

void copy(unsigned *const dst, unsigned const *const src, unsigned long n)
{
    unsigned *tmp = malloc(n * sizeof(*tmp));
    for (unsigned long x = 0; x < n; ++x)
    {
        tmp[x] = src[x];
    }

    for (unsigned long x = 0; x < n; ++x)
    {
        dst[x] = tmp[x];
    }

    free(tmp);
}

clang 会很好地用 memmove:

替换它
copy:                                   # @copy
        testq   %rdx, %rdx
        je      .LBB0_2
        pushq   %rax
        shlq    , %rdx
        callq   memmove@PLT
        addq    , %rsp
.LBB0_2:
        retq

demo

Why can't this code be replaced with memmove?

相反:memmove()被优化为向量化循环

所以memmcpy()最快,SIMD次之,memmove()第三。

(在 gcc 上测试,似乎与 OP 中的 clang 完全相同。)

组合 2xrestrict 加上 memmove 导致 memcpy。这都是可能重叠的问题。