为什么 clang 没有用 memmove 替换这个循环
Why clang does not replace this loop with a memmove
考虑这个 memcpy 类似的函数:
void copy(unsigned *restrict const dst, unsigned const *restrict const src, unsigned long n)
{
for (unsigned long x = 0; x < n; ++x)
{
dst[x] = src[x];
}
}
此代码针对 memcpy 进行了很好的优化:
copy:
cbz x2, .L1
lsl x2, x2, 2
b memcpy
.L1:
ret
但是,当我删除 restrict
时,clang
应用循环矢量化并 does not replace 它带有 memmove
。这是为什么?
我尝试在启用优化报告的情况下进行编译:
clang-10 main.c -c -O3 -fsave-optimization-record -S && cat ./main.opt.yaml
这就是我用 restrict
:
得到的结果
--- !Passed
Pass: loop-idiom
Name: ProcessLoopStoreOfLoopLoad
DebugLoc: { File: main.c, Line: 4, Column: 12 }
Function: copy
Args:
- String: 'Formed a call to '
- NewFunction: llvm.memcpy.p0i8.p0i8.i64
- String: '() function'
...
和w/o restrict
:
--- !Passed
Pass: loop-vectorize
Name: Vectorized
DebugLoc: { File: main.c, Line: 3, Column: 3 }
Function: copy
Args:
- String: 'vectorized loop (vectorization width: '
- VectorizationFactor: '4'
- String: ', interleaved count: '
- InterleaveCount: '2'
- String: ')'
...
优化器直接进入循环矢量化,跳过 ProcessLoopStoreOfLoopLoad
,不打印任何消息。这是为什么?为什么不能将此代码替换为 memmove
?
这是关于数组之间发生碰撞时操作的可观察效果。
例如:
1 2 3 4
如果 src 指向 1,dst 指向 2,结果应该是
1 1 1 1
另一方面,Memmove,在重叠的情况下这样做:
The memory areas may overlap: copying takes place as though the bytes in src are first copied into a temporary array that does not overlap src or dest, and the bytes are
then copied from the temporary array to dest.
即这种复制的结果将是:
1 1 2 3
与原始代码有显着差异的地方。
另外,如果你写这个 memmove 类似的代码:
#include <stdlib.h>
void copy(unsigned *const dst, unsigned const *const src, unsigned long n)
{
unsigned *tmp = malloc(n * sizeof(*tmp));
for (unsigned long x = 0; x < n; ++x)
{
tmp[x] = src[x];
}
for (unsigned long x = 0; x < n; ++x)
{
dst[x] = tmp[x];
}
free(tmp);
}
clang
会很好地用 memmove
:
替换它
copy: # @copy
testq %rdx, %rdx
je .LBB0_2
pushq %rax
shlq , %rdx
callq memmove@PLT
addq , %rsp
.LBB0_2:
retq
Why can't this code be replaced with memmove?
相反:memmove()
被优化为向量化循环。
所以memmcpy()
最快,SIMD次之,memmove()
第三。
(在 gcc 上测试,似乎与 OP 中的 clang 完全相同。)
组合 2xrestrict
加上 memmove
导致 memcpy
。这都是可能重叠的问题。
考虑这个 memcpy 类似的函数:
void copy(unsigned *restrict const dst, unsigned const *restrict const src, unsigned long n)
{
for (unsigned long x = 0; x < n; ++x)
{
dst[x] = src[x];
}
}
此代码针对 memcpy 进行了很好的优化:
copy:
cbz x2, .L1
lsl x2, x2, 2
b memcpy
.L1:
ret
但是,当我删除 restrict
时,clang
应用循环矢量化并 does not replace 它带有 memmove
。这是为什么?
我尝试在启用优化报告的情况下进行编译:
clang-10 main.c -c -O3 -fsave-optimization-record -S && cat ./main.opt.yaml
这就是我用 restrict
:
--- !Passed
Pass: loop-idiom
Name: ProcessLoopStoreOfLoopLoad
DebugLoc: { File: main.c, Line: 4, Column: 12 }
Function: copy
Args:
- String: 'Formed a call to '
- NewFunction: llvm.memcpy.p0i8.p0i8.i64
- String: '() function'
...
和w/o restrict
:
--- !Passed
Pass: loop-vectorize
Name: Vectorized
DebugLoc: { File: main.c, Line: 3, Column: 3 }
Function: copy
Args:
- String: 'vectorized loop (vectorization width: '
- VectorizationFactor: '4'
- String: ', interleaved count: '
- InterleaveCount: '2'
- String: ')'
...
优化器直接进入循环矢量化,跳过 ProcessLoopStoreOfLoopLoad
,不打印任何消息。这是为什么?为什么不能将此代码替换为 memmove
?
这是关于数组之间发生碰撞时操作的可观察效果。
例如:
1 2 3 4
如果 src 指向 1,dst 指向 2,结果应该是
1 1 1 1
另一方面,Memmove,在重叠的情况下这样做:
The memory areas may overlap: copying takes place as though the bytes in src are first copied into a temporary array that does not overlap src or dest, and the bytes are then copied from the temporary array to dest.
即这种复制的结果将是:
1 1 2 3
与原始代码有显着差异的地方。
另外,如果你写这个 memmove 类似的代码:
#include <stdlib.h>
void copy(unsigned *const dst, unsigned const *const src, unsigned long n)
{
unsigned *tmp = malloc(n * sizeof(*tmp));
for (unsigned long x = 0; x < n; ++x)
{
tmp[x] = src[x];
}
for (unsigned long x = 0; x < n; ++x)
{
dst[x] = tmp[x];
}
free(tmp);
}
clang
会很好地用 memmove
:
copy: # @copy
testq %rdx, %rdx
je .LBB0_2
pushq %rax
shlq , %rdx
callq memmove@PLT
addq , %rsp
.LBB0_2:
retq
Why can't this code be replaced with memmove?
相反:memmove()
被优化为向量化循环。
所以memmcpy()
最快,SIMD次之,memmove()
第三。
(在 gcc 上测试,似乎与 OP 中的 clang 完全相同。)
组合 2xrestrict
加上 memmove
导致 memcpy
。这都是可能重叠的问题。