是否有一些有意义的统计数据来证明保持有符号整数算术溢出未定义是合理的?
Is there some meaningful statistical data to justify keeping signed integer arithmetic overflow undefined?
C 标准明确指定有符号整数溢出具有未定义的行为。然而,大多数 CPU 都为溢出实现了具有定义语义的带符号算术(除法溢出可能除外:x / 0
和 INT_MIN / -1
)。
编译器编写者一直在利用此类溢出的 未定义性 来添加更积极的优化,这些优化往往会以非常微妙的方式破坏遗留代码。例如,此代码可能适用于较旧的编译器,但不再适用于当前版本的 gcc
和 clang
:
/* Tncrement a by a value in 0..255, clamp a to positive integers.
The code relies on 32-bit wrap-around, but the C Standard makes
signed integer overflow undefined behavior, so sum_max can now
return values less than a. There are Standard compliant ways to
implement this, but legacy code is what it is... */
int sum_max(int a, unsigned char b) {
int res = a + b;
return (res >= a) ? res : INT_MAX;
}
是否有确凿证据表明这些优化是值得的?是否有比较研究记录真实生活示例甚至经典基准的实际改进?
我在看这个的时候想到了这个问题:C++Now 2018: John Regehr “Closing Keynote: Undefined Behavior and Compiler Optimizations”
我正在标记 c 和 c++ 因为这两种语言的问题相似但答案可能不同。
不完全是优化的例子,但未定义行为的一个有用结果是 -ftrapv
GCC/clang 的命令行开关。它插入的代码会使您的程序在整数溢出时崩溃。
根据无符号溢出是故意的想法,它不适用于无符号整数。
标准对有符号整数溢出的措辞确保人们不会故意编写溢出代码,因此ftrapv
是发现无意溢出的有用工具。
答案实际上在你的问题中:
Yet most CPUs implement signed arithmetics with defined semantics
我想不出你今天可以买到的 CPU 不使用 twos-compliment 算法来计算有符号整数的,但情况并非总是如此。
C语言是1972年发明的,那时候还有IBM 7090大型机。并非所有计算机都是 twos-compliment.
围绕 2s-compliment 定义语言(和溢出行为)将不利于在没有的机器上生成代码。
此外,正如已经说过的,指定有符号溢出是 UB 允许编译器生成更好的代码,因为它可以折扣由有符号溢出导致的代码路径,假设这永远不会发生。
如果我理解正确,它的目的是将 a 和 b 的总和限制为 0....INT_MAX,我可以想出两种方法来以兼容的方式编写此函数。
首先,适用于所有 CPU 的低效一般情况:
int sum_max(int a, unsigned char b) {
if (a > std::numeric_limits<int>::max() - b)
return std::numeric_limits<int>::max();
else
return a + b;
}
二、惊人高效的2s-compliment具体方式:
int sum_max2(int a, unsigned char b) {
unsigned int buffer;
std::memcpy(&buffer, &a, sizeof(a));
buffer += b;
if (buffer > std::numeric_limits<int>::max())
buffer = std::numeric_limits<int>::max();
std::memcpy(&a, &buffer, sizeof(a));
return a;
}
生成的汇编器可以在这里看到:https://godbolt.org/z/F42IXV
我不知道研究和统计数据,但是是的,考虑到编译器实际做的,肯定有优化。是的,它们非常重要(例如 tldr 循环矢量化)。
除了编译器优化之外,还有一个方面需要考虑。使用 UB,您可以获得 C/C++ 有符号整数,其算术行为与您期望的数学行为相同。例如 x + 10 > x
现在成立(当然对于有效代码),但不会出现 wrap-around 行为。
我从 Krister Walfridsson 的博客中找到了一篇很棒的文章 How undefined signed overflow enables optimizations in GCC,其中列出了一些将签名溢出 UB 考虑在内的优化。下面的例子都来自它。我正在向它们添加 C++ 和汇编示例。
如果优化看起来过于简单、无趣或没有影响,请记住,这些优化只是更大的优化链中的步骤。蝴蝶效应确实发生了,因为在较早的步骤中看似不重要的优化可能会在后面的步骤中触发更有影响力的优化。
如果这些示例看起来毫无意义(谁会写 x * 10 > 0
)请记住,您可以使用常量、宏和模板轻松地使用 C 和 C++ 中的此类示例。此外,编译器在其 IR 中应用转换和优化时可以获得此类示例。
有符号整数表达式简化
与0比较消去乘法
(x * c) cmp 0 -> x cmp 0
bool foo(int x) { return x * 10 > 0 }
foo(int):
test edi, edi
setg al
ret
乘完除法
(x * c1) / c2 -> x * (c1 / c2) if c1 is divisible by c2
int foo(int x) { return (x * 20) / 10; }
foo(int):
lea eax, [rdi+rdi]
ret
消除否定
(-x) / (-y) -> x / y
int foo(int x, int y) { return (-x) / (-y); }
foo(int, int):
mov eax, edi
cdq
idiv esi
ret
简化始终为真或假的比较
x + c < x -> false
x + c <= x -> false
x + c > x -> true
x + c >= x -> true
bool foo(int x) { return x + 10 >= x; }
foo(int):
mov eax, 1
ret
消除比较中的否定
(-x) cmp (-y) -> y cmp x
bool foo(int x, int y) { return -x < -y; }
foo(int, int):
cmp edi, esi
setg al
ret
减小常量的大小
x + c > y -> x + (c - 1) >= y
x + c <= y -> x + (c - 1) < y
bool foo(int x, int y) { return x + 10 <= y; }
foo(int, int):
add edi, 9
cmp edi, esi
setl al
ret
消除比较中的常量
(x + c1) cmp c2 -> x cmp (c2 - c1)
(x + c1) cmp (y + c2) -> x cmp (y + (c2 - c1)) if c1 <= c2
The second transformation is only valid if c1 <= c2, as it would
otherwise introduce an overflow when y has the value INT_MIN.
bool foo(int x) { return x + 42 <= 11; }
foo(int):
cmp edi, -30
setl al
ret
指针运算和类型提升
If an operation does not overflow, then we will get the same result if
we do the operation in a wider type. This is often useful when doing
things like array indexing on 64-bit architectures — the index
calculations are typically done using 32-bit int, but the pointers are
64-bit, and the compiler may generate more efficient code when signed
overflow is undefined by promoting the 32-bit integers to 64-bit
operations instead of generating type extensions.
One other aspect of this is that undefined overflow ensures that a[i]
and a[i+1] are adjacent. This improves analysis of memory accesses for
vectorization etc.
这是一个非常重要的优化,因为循环矢量化是最有效的优化算法之一。
这是将索引从无符号索引更改为有符号索引改进生成的程序集的示例:
未签名版
#include <cstddef>
auto foo(int* v, std::size_t start)
{
int sum = 0;
for (std::size_t i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
对于未签名的情况,必须考虑 start + 4
环绕的情况,并生成一个分支来处理这种情况(分支对性能不利):
; gcc on x64 with -march=skylake
foo1(int*, unsigned long):
cmp rsi, -5
ja .L3
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
.L3:
xor eax, eax
ret
; clang on x64 with -march=skylake
foo1(int*, unsigned long): # @foo1(int*, unsigned long)
xor eax, eax
cmp rsi, -4
jae .LBB0_2
vpbroadcastq xmm0, qword ptr [rdi + 4*rsi + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
.LBB0_2:
ret
作为旁注,使用更窄的类型会导致更糟糕的汇编,从而禁止使用 SSE 向量化指令:
#include <cstddef>
auto foo(int* v, unsigned start)
{
int sum = 0;
for (unsigned i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, unsigned int):
cmp esi, -5
ja .L3
mov eax, esi
mov eax, DWORD PTR [rdi+rax*4]
lea edx, [rsi+1]
add eax, DWORD PTR [rdi+rdx*4]
lea edx, [rsi+2]
add eax, DWORD PTR [rdi+rdx*4]
lea edx, [rsi+3]
add eax, DWORD PTR [rdi+rdx*4]
ret
.L3:
xor eax, eax
ret
; clang on x64 with -march=skylake
foo(int*, unsigned int): # @foo(int*, unsigned int)
xor eax, eax
cmp esi, -5
ja .LBB0_3
mov ecx, esi
add esi, 4
mov eax, dword ptr [rdi + 4*rcx]
lea rdx, [rcx + 1]
cmp rdx, rsi
jae .LBB0_3
add eax, dword ptr [rdi + 4*rcx + 4]
add eax, dword ptr [rdi + 4*rcx + 8]
add eax, dword ptr [rdi + 4*rcx + 12]
.LBB0_3:
ret
签名版
然而,使用带符号的索引会产生漂亮的矢量化无分支代码:
#include <cstddef>
auto foo(int* v, std::ptrdiff_t start)
{
int sum = 0;
for (std::ptrdiff_t i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, long):
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
; clang on x64 with -march=skylake
foo(int*, long): # @foo(int*, long)
vpbroadcastq xmm0, qword ptr [rdi + 4*rsi + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
当使用更窄的有符号类型时,仍然使用矢量化指令:
#include <cstddef>
auto foo(int* v, int start)
{
int sum = 0;
for (int i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, int):
movsx rsi, esi
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
; clang on x64 with -march=skylake
foo(int*, int): # @foo(int*, int)
movsxd rax, esi
vpbroadcastq xmm0, qword ptr [rdi + 4*rax + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
数值范围计算
The compiler keeps track of the variables' range of possible values at
each point in the program, i.e. for code such as
int x = foo();
if (x > 0) {
int y = x + 5;
int z = y / 4;
it determines that x has the range [1, INT_MAX]
after the
if-statement, and can thus determine that y has the range [6, INT_MAX]
as overflow is not allowed. And the next line can be
optimized to int z = y >> 2;
as the compiler knows that y is
non-negative.
auto foo(int x)
{
if (x <= 0)
__builtin_unreachable();
return (x + 5) / 4;
}
foo(int):
lea eax, [rdi+5]
sar eax, 2
ret
The undefined overflow helps optimizations that need to compare two
values (as the wrapping case would give possible values of the form
[INT_MIN, (INT_MIN+4)]
or [6, INT_MAX]
that prevents all useful
comparisons with <
or >
), such as
- Changing comparisons
x<y
to true or false if the ranges for x
and y
does not overlap
- Changing
min(x,y)
or max(x,y)
to x
or y
if the ranges do not overlap
- Changing
abs(x)
to x
or -x
if the range does not cross 0
- Changing
x/c
to x>>log2(c)
if x>0
and the constant c
is a power of 2
- Changing
x%c
to x&(c-1)
if x>0
and the constant c
is a power of 2
循环分析和优化
The canonical example of why undefined signed overflow helps loop
optimizations is that loops like
for (int i = 0; i <= m; i++)
are guaranteed to terminate for undefined overflow. This helps
architectures that have specific loop instructions, as they do in
general not handle infinite loops.
But undefined signed overflow helps many more loop optimizations. All
analysis such as determining number of iteration, transforming
induction variables, and keeping track of memory accesses are using
everything in the previous sections in order to do its work. In
particular, the set of loops that can be vectorized are severely
reduced when signed overflow is allowed.
这是一个真正的小基准,冒泡排序。我比较了时间 without/with -fwrapv
(这意味着溢出是 UB/not UB)。以下是结果(秒):
-O3 -O3 -fwrapv -O1 -O1 -fwrapv
Machine1, clang 5.2 6.3 6.8 7.7
Machine2, clang-8 4.2 7.8 6.4 6.7
Machine2, gcc-8 6.6 7.4 6.5 6.5
如您所见,not-UB (-fwrapv
) 版本几乎总是较慢,最大的差异相当大,1.85x。
这是代码。请注意,我有意选择了一个实现,它应该会为此测试产生更大的差异。
#include <stdio.h>
#include <stdlib.h>
void bubbleSort(int *a, long n) {
bool swapped;
for (int i = 0; i < n-1; i++) {
swapped = false;
for (int j = 0; j < n-i-1; j++) {
if (a[j] > a[j+1]) {
int t = a[j];
a[j] = a[j+1];
a[j+1] = t;
swapped = true;
}
}
if (!swapped) break;
}
}
int main() {
int a[8192];
for (int j=0; j<100; j++) {
for (int i=0; i<8192; i++) {
a[i] = rand();
}
bubbleSort(a, 8192);
}
}
C 标准明确指定有符号整数溢出具有未定义的行为。然而,大多数 CPU 都为溢出实现了具有定义语义的带符号算术(除法溢出可能除外:x / 0
和 INT_MIN / -1
)。
编译器编写者一直在利用此类溢出的 未定义性 来添加更积极的优化,这些优化往往会以非常微妙的方式破坏遗留代码。例如,此代码可能适用于较旧的编译器,但不再适用于当前版本的 gcc
和 clang
:
/* Tncrement a by a value in 0..255, clamp a to positive integers.
The code relies on 32-bit wrap-around, but the C Standard makes
signed integer overflow undefined behavior, so sum_max can now
return values less than a. There are Standard compliant ways to
implement this, but legacy code is what it is... */
int sum_max(int a, unsigned char b) {
int res = a + b;
return (res >= a) ? res : INT_MAX;
}
是否有确凿证据表明这些优化是值得的?是否有比较研究记录真实生活示例甚至经典基准的实际改进?
我在看这个的时候想到了这个问题:C++Now 2018: John Regehr “Closing Keynote: Undefined Behavior and Compiler Optimizations”
我正在标记 c 和 c++ 因为这两种语言的问题相似但答案可能不同。
不完全是优化的例子,但未定义行为的一个有用结果是 -ftrapv
GCC/clang 的命令行开关。它插入的代码会使您的程序在整数溢出时崩溃。
根据无符号溢出是故意的想法,它不适用于无符号整数。
标准对有符号整数溢出的措辞确保人们不会故意编写溢出代码,因此ftrapv
是发现无意溢出的有用工具。
答案实际上在你的问题中:
Yet most CPUs implement signed arithmetics with defined semantics
我想不出你今天可以买到的 CPU 不使用 twos-compliment 算法来计算有符号整数的,但情况并非总是如此。
C语言是1972年发明的,那时候还有IBM 7090大型机。并非所有计算机都是 twos-compliment.
围绕 2s-compliment 定义语言(和溢出行为)将不利于在没有的机器上生成代码。
此外,正如已经说过的,指定有符号溢出是 UB 允许编译器生成更好的代码,因为它可以折扣由有符号溢出导致的代码路径,假设这永远不会发生。
如果我理解正确,它的目的是将 a 和 b 的总和限制为 0....INT_MAX,我可以想出两种方法来以兼容的方式编写此函数。
首先,适用于所有 CPU 的低效一般情况:
int sum_max(int a, unsigned char b) {
if (a > std::numeric_limits<int>::max() - b)
return std::numeric_limits<int>::max();
else
return a + b;
}
二、惊人高效的2s-compliment具体方式:
int sum_max2(int a, unsigned char b) {
unsigned int buffer;
std::memcpy(&buffer, &a, sizeof(a));
buffer += b;
if (buffer > std::numeric_limits<int>::max())
buffer = std::numeric_limits<int>::max();
std::memcpy(&a, &buffer, sizeof(a));
return a;
}
生成的汇编器可以在这里看到:https://godbolt.org/z/F42IXV
我不知道研究和统计数据,但是是的,考虑到编译器实际做的,肯定有优化。是的,它们非常重要(例如 tldr 循环矢量化)。
除了编译器优化之外,还有一个方面需要考虑。使用 UB,您可以获得 C/C++ 有符号整数,其算术行为与您期望的数学行为相同。例如 x + 10 > x
现在成立(当然对于有效代码),但不会出现 wrap-around 行为。
我从 Krister Walfridsson 的博客中找到了一篇很棒的文章 How undefined signed overflow enables optimizations in GCC,其中列出了一些将签名溢出 UB 考虑在内的优化。下面的例子都来自它。我正在向它们添加 C++ 和汇编示例。
如果优化看起来过于简单、无趣或没有影响,请记住,这些优化只是更大的优化链中的步骤。蝴蝶效应确实发生了,因为在较早的步骤中看似不重要的优化可能会在后面的步骤中触发更有影响力的优化。
如果这些示例看起来毫无意义(谁会写 x * 10 > 0
)请记住,您可以使用常量、宏和模板轻松地使用 C 和 C++ 中的此类示例。此外,编译器在其 IR 中应用转换和优化时可以获得此类示例。
有符号整数表达式简化
与0比较消去乘法
(x * c) cmp 0 -> x cmp 0
bool foo(int x) { return x * 10 > 0 }
foo(int): test edi, edi setg al ret
乘完除法
(x * c1) / c2 -> x * (c1 / c2) if c1 is divisible by c2
int foo(int x) { return (x * 20) / 10; }
foo(int): lea eax, [rdi+rdi] ret
消除否定
(-x) / (-y) -> x / y
int foo(int x, int y) { return (-x) / (-y); }
foo(int, int): mov eax, edi cdq idiv esi ret
简化始终为真或假的比较
x + c < x -> false x + c <= x -> false x + c > x -> true x + c >= x -> true
bool foo(int x) { return x + 10 >= x; }
foo(int): mov eax, 1 ret
消除比较中的否定
(-x) cmp (-y) -> y cmp x
bool foo(int x, int y) { return -x < -y; }
foo(int, int): cmp edi, esi setg al ret
减小常量的大小
x + c > y -> x + (c - 1) >= y x + c <= y -> x + (c - 1) < y
bool foo(int x, int y) { return x + 10 <= y; }
foo(int, int): add edi, 9 cmp edi, esi setl al ret
消除比较中的常量
(x + c1) cmp c2 -> x cmp (c2 - c1) (x + c1) cmp (y + c2) -> x cmp (y + (c2 - c1)) if c1 <= c2
The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.
bool foo(int x) { return x + 42 <= 11; }
foo(int): cmp edi, -30 setl al ret
指针运算和类型提升
If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.
One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.
这是一个非常重要的优化,因为循环矢量化是最有效的优化算法之一。
这是将索引从无符号索引更改为有符号索引改进生成的程序集的示例:
未签名版
#include <cstddef>
auto foo(int* v, std::size_t start)
{
int sum = 0;
for (std::size_t i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
对于未签名的情况,必须考虑 start + 4
环绕的情况,并生成一个分支来处理这种情况(分支对性能不利):
; gcc on x64 with -march=skylake
foo1(int*, unsigned long):
cmp rsi, -5
ja .L3
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
.L3:
xor eax, eax
ret
; clang on x64 with -march=skylake
foo1(int*, unsigned long): # @foo1(int*, unsigned long)
xor eax, eax
cmp rsi, -4
jae .LBB0_2
vpbroadcastq xmm0, qword ptr [rdi + 4*rsi + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
.LBB0_2:
ret
作为旁注,使用更窄的类型会导致更糟糕的汇编,从而禁止使用 SSE 向量化指令:
#include <cstddef>
auto foo(int* v, unsigned start)
{
int sum = 0;
for (unsigned i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, unsigned int):
cmp esi, -5
ja .L3
mov eax, esi
mov eax, DWORD PTR [rdi+rax*4]
lea edx, [rsi+1]
add eax, DWORD PTR [rdi+rdx*4]
lea edx, [rsi+2]
add eax, DWORD PTR [rdi+rdx*4]
lea edx, [rsi+3]
add eax, DWORD PTR [rdi+rdx*4]
ret
.L3:
xor eax, eax
ret
; clang on x64 with -march=skylake
foo(int*, unsigned int): # @foo(int*, unsigned int)
xor eax, eax
cmp esi, -5
ja .LBB0_3
mov ecx, esi
add esi, 4
mov eax, dword ptr [rdi + 4*rcx]
lea rdx, [rcx + 1]
cmp rdx, rsi
jae .LBB0_3
add eax, dword ptr [rdi + 4*rcx + 4]
add eax, dword ptr [rdi + 4*rcx + 8]
add eax, dword ptr [rdi + 4*rcx + 12]
.LBB0_3:
ret
签名版
然而,使用带符号的索引会产生漂亮的矢量化无分支代码:
#include <cstddef>
auto foo(int* v, std::ptrdiff_t start)
{
int sum = 0;
for (std::ptrdiff_t i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, long):
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
; clang on x64 with -march=skylake
foo(int*, long): # @foo(int*, long)
vpbroadcastq xmm0, qword ptr [rdi + 4*rsi + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
当使用更窄的有符号类型时,仍然使用矢量化指令:
#include <cstddef>
auto foo(int* v, int start)
{
int sum = 0;
for (int i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, int):
movsx rsi, esi
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
; clang on x64 with -march=skylake
foo(int*, int): # @foo(int*, int)
movsxd rax, esi
vpbroadcastq xmm0, qword ptr [rdi + 4*rax + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
数值范围计算
The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as
int x = foo(); if (x > 0) { int y = x + 5; int z = y / 4;
it determines that x has the range
[1, INT_MAX]
after the if-statement, and can thus determine that y has the range[6, INT_MAX]
as overflow is not allowed. And the next line can be optimized toint z = y >> 2;
as the compiler knows that y is non-negative.
auto foo(int x)
{
if (x <= 0)
__builtin_unreachable();
return (x + 5) / 4;
}
foo(int):
lea eax, [rdi+5]
sar eax, 2
ret
The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form
[INT_MIN, (INT_MIN+4)]
or[6, INT_MAX]
that prevents all useful comparisons with<
or>
), such as
- Changing comparisons
x<y
to true or false if the ranges forx
andy
does not overlap- Changing
min(x,y)
ormax(x,y)
tox
ory
if the ranges do not overlap- Changing
abs(x)
tox
or-x
if the range does not cross0
- Changing
x/c
tox>>log2(c)
ifx>0
and the constantc
is a power of2
- Changing
x%c
tox&(c-1)
ifx>0
and the constantc
is a power of2
循环分析和优化
The canonical example of why undefined signed overflow helps loop optimizations is that loops like
for (int i = 0; i <= m; i++)
are guaranteed to terminate for undefined overflow. This helps architectures that have specific loop instructions, as they do in general not handle infinite loops.
But undefined signed overflow helps many more loop optimizations. All analysis such as determining number of iteration, transforming induction variables, and keeping track of memory accesses are using everything in the previous sections in order to do its work. In particular, the set of loops that can be vectorized are severely reduced when signed overflow is allowed.
这是一个真正的小基准,冒泡排序。我比较了时间 without/with -fwrapv
(这意味着溢出是 UB/not UB)。以下是结果(秒):
-O3 -O3 -fwrapv -O1 -O1 -fwrapv
Machine1, clang 5.2 6.3 6.8 7.7
Machine2, clang-8 4.2 7.8 6.4 6.7
Machine2, gcc-8 6.6 7.4 6.5 6.5
如您所见,not-UB (-fwrapv
) 版本几乎总是较慢,最大的差异相当大,1.85x。
这是代码。请注意,我有意选择了一个实现,它应该会为此测试产生更大的差异。
#include <stdio.h>
#include <stdlib.h>
void bubbleSort(int *a, long n) {
bool swapped;
for (int i = 0; i < n-1; i++) {
swapped = false;
for (int j = 0; j < n-i-1; j++) {
if (a[j] > a[j+1]) {
int t = a[j];
a[j] = a[j+1];
a[j+1] = t;
swapped = true;
}
}
if (!swapped) break;
}
}
int main() {
int a[8192];
for (int j=0; j<100; j++) {
for (int i=0; i<8192; i++) {
a[i] = rand();
}
bubbleSort(a, 8192);
}
}