是否有一些有意义的统计数据来证明保持有符号整数算术溢出未定义是合理的？

Question

C 标准明确指定有符号整数溢出具有未定义的行为。然而，大多数 CPU 都为溢出实现了具有定义语义的带符号算术（除法溢出可能除外：x / 0 和 INT_MIN / -1）。

编译器编写者一直在利用此类溢出的 未定义性 来添加更积极的优化，这些优化往往会以非常微妙的方式破坏遗留代码。例如，此代码可能适用于较旧的编译器，但不再适用于当前版本的 gcc 和 clang:

/* Tncrement a by a value in 0..255, clamp a to positive integers.
   The code relies on 32-bit wrap-around, but the C Standard makes
   signed integer overflow undefined behavior, so sum_max can now 
   return values less than a. There are Standard compliant ways to
   implement this, but legacy code is what it is... */
int sum_max(int a, unsigned char b) {
    int res = a + b;
    return (res >= a) ? res : INT_MAX;
}

是否有确凿证据表明这些优化是值得的？是否有比较研究记录真实生活示例甚至经典基准的实际改进？

我在看这个的时候想到了这个问题：C++Now 2018: John Regehr “Closing Keynote: Undefined Behavior and Compiler Optimizations”

我正在标记 c 和 c++ 因为这两种语言的问题相似但答案可能不同。

Answer 1

不完全是优化的例子，但未定义行为的一个有用结果是 -ftrapv GCC/clang 的命令行开关。它插入的代码会使您的程序在整数溢出时崩溃。

根据无符号溢出是故意的想法，它不适用于无符号整数。

标准对有符号整数溢出的措辞确保人们不会故意编写溢出代码，因此ftrapv是发现无意溢出的有用工具。

Answer 2

答案实际上在你的问题中：

Yet most CPUs implement signed arithmetics with defined semantics

我想不出你今天可以买到的 CPU 不使用 twos-compliment 算法来计算有符号整数的，但情况并非总是如此。

C语言是1972年发明的，那时候还有IBM 7090大型机。并非所有计算机都是 twos-compliment.

围绕 2s-compliment 定义语言（和溢出行为）将不利于在没有的机器上生成代码。

此外，正如已经说过的，指定有符号溢出是 UB 允许编译器生成更好的代码，因为它可以折扣由有符号溢出导致的代码路径，假设这永远不会发生。

如果我理解正确，它的目的是将 a 和 b 的总和限制为 0....INT_MAX，我可以想出两种方法来以兼容的方式编写此函数。

首先，适用于所有 CPU 的低效一般情况：

int sum_max(int a, unsigned char b) {
    if (a > std::numeric_limits<int>::max() - b)
        return std::numeric_limits<int>::max();
    else
        return a + b;
}

二、惊人高效的2s-compliment具体方式：

int sum_max2(int a, unsigned char b) {
    unsigned int buffer;
    std::memcpy(&buffer, &a, sizeof(a));
    buffer += b;
    if (buffer > std::numeric_limits<int>::max())
        buffer = std::numeric_limits<int>::max();
    std::memcpy(&a, &buffer, sizeof(a));
    return a;
}

生成的汇编器可以在这里看到：https://godbolt.org/z/F42IXV

Answer 3

我不知道研究和统计数据，但是是的，考虑到编译器实际做的，肯定有优化。是的，它们非常重要（例如 tldr 循环矢量化）。

除了编译器优化之外，还有一个方面需要考虑。使用 UB，您可以获得 C/C++ 有符号整数，其算术行为与您期望的数学行为相同。例如 x + 10 > x 现在成立（当然对于有效代码），但不会出现 wrap-around 行为。

我从 Krister Walfridsson 的博客中找到了一篇很棒的文章 How undefined signed overflow enables optimizations in GCC，其中列出了一些将签名溢出 UB 考虑在内的优化。下面的例子都来自它。我正在向它们添加 C++ 和汇编示例。

如果优化看起来过于简单、无趣或没有影响，请记住，这些优化只是更大的优化链中的步骤。蝴蝶效应确实发生了，因为在较早的步骤中看似不重要的优化可能会在后面的步骤中触发更有影响力的优化。

如果这些示例看起来毫无意义（谁会写 x * 10 > 0）请记住，您可以使用常量、宏和模板轻松地使用 C 和 C++ 中的此类示例。此外，编译器在其 IR 中应用转换和优化时可以获得此类示例。

有符号整数表达式简化

与0比较消去乘法

(x * c) cmp 0   ->   x cmp 0

bool foo(int x) { return x * 10 > 0 }

foo(int):
        test    edi, edi
        setg    al
        ret

乘完除法

(x * c1) / c2 -> x * (c1 / c2) if c1 is divisible by c2

int foo(int x) { return (x * 20) / 10; }

foo(int):
        lea     eax, [rdi+rdi]
        ret

消除否定

(-x) / (-y) -> x / y

int foo(int x, int y) { return (-x) / (-y); }

foo(int, int):
        mov     eax, edi
        cdq
        idiv    esi
        ret

简化始终为真或假的比较

x + c < x       ->   false
x + c <= x      ->   false
x + c > x       ->   true
x + c >= x      ->   true

bool foo(int x) { return x + 10 >= x; }

foo(int):
        mov     eax, 1
        ret

消除比较中的否定

(-x) cmp (-y)   ->   y cmp x

bool foo(int x, int y) { return -x < -y; }

foo(int, int):
        cmp     edi, esi
        setg    al
        ret

减小常量的大小

x + c > y       ->   x + (c - 1) >= y
x + c <= y      ->   x + (c - 1) < y

bool foo(int x, int y) { return x + 10 <= y; }

foo(int, int):
        add     edi, 9
        cmp     edi, esi
        setl    al
        ret

消除比较中的常量

(x + c1) cmp c2         ->   x cmp (c2 - c1)
(x + c1) cmp (y + c2)   ->   x cmp (y + (c2 - c1)) if c1 <= c2

The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.

bool foo(int x) { return x + 42 <= 11; }

foo(int):
        cmp     edi, -30
        setl    al
        ret

指针运算和类型提升

If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.

One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.

这是一个非常重要的优化，因为循环矢量化是最有效的优化算法之一。

这是将索引从无符号索引更改为有符号索引改进生成的程序集的示例：

未签名版

#include <cstddef>

auto foo(int* v, std::size_t start)
{
    int sum = 0;

    for (std::size_t i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}

对于未签名的情况，必须考虑 start + 4 环绕的情况，并生成一个分支来处理这种情况（分支对性能不利）：

; gcc on x64 with -march=skylake

foo1(int*, unsigned long):
        cmp     rsi, -5
        ja      .L3
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
.L3:
        xor     eax, eax
        ret

; clang on x64 with -march=skylake

foo1(int*, unsigned long):                             # @foo1(int*, unsigned long)
        xor     eax, eax
        cmp     rsi, -4
        jae     .LBB0_2
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rsi + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
.LBB0_2:
        ret

作为旁注，使用更窄的类型会导致更糟糕的汇编，从而禁止使用 SSE 向量化指令：

#include <cstddef>

auto foo(int* v, unsigned start)
{
    int sum = 0;

    for (unsigned i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}

; gcc on x64 with -march=skylake

foo(int*, unsigned int):
        cmp     esi, -5
        ja      .L3
        mov     eax, esi
        mov     eax, DWORD PTR [rdi+rax*4]
        lea     edx, [rsi+1]
        add     eax, DWORD PTR [rdi+rdx*4]
        lea     edx, [rsi+2]
        add     eax, DWORD PTR [rdi+rdx*4]
        lea     edx, [rsi+3]
        add     eax, DWORD PTR [rdi+rdx*4]
        ret
.L3:
        xor     eax, eax
        ret

; clang on x64 with -march=skylake

foo(int*, unsigned int):                              # @foo(int*, unsigned int)
        xor     eax, eax
        cmp     esi, -5
        ja      .LBB0_3
        mov     ecx, esi
        add     esi, 4
        mov     eax, dword ptr [rdi + 4*rcx]
        lea     rdx, [rcx + 1]
        cmp     rdx, rsi
        jae     .LBB0_3
        add     eax, dword ptr [rdi + 4*rcx + 4]
        add     eax, dword ptr [rdi + 4*rcx + 8]
        add     eax, dword ptr [rdi + 4*rcx + 12]
.LBB0_3:
        ret

签名版

然而，使用带符号的索引会产生漂亮的矢量化无分支代码：

#include <cstddef>

auto foo(int* v, std::ptrdiff_t start)
{
    int sum = 0;

    for (std::ptrdiff_t i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}

; gcc on x64 with -march=skylake

foo(int*, long):
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

; clang on x64 with -march=skylake

foo(int*, long):                              # @foo(int*, long)
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rsi + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

当使用更窄的有符号类型时，仍然使用矢量化指令：

#include <cstddef>

auto foo(int* v, int start)
{
    int sum = 0;

    for (int i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}

; gcc on x64 with -march=skylake

foo(int*, int):
        movsx   rsi, esi
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

; clang on x64 with -march=skylake

foo(int*, int):                              # @foo(int*, int)
        movsxd  rax, esi
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rax + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rax]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

数值范围计算

The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as
int x = foo();
if (x > 0) {
  int y = x + 5;
  int z = y / 4;
it determines that x has the range [1, INT_MAX] after the if-statement, and can thus determine that y has the range [6, INT_MAX] as overflow is not allowed. And the next line can be optimized to int z = y >> 2; as the compiler knows that y is non-negative.

auto foo(int x)
{
    if (x <= 0)
        __builtin_unreachable();
    
    return (x + 5) / 4;
}

foo(int):
        lea     eax, [rdi+5]
        sar     eax, 2
        ret

The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form [INT_MIN, (INT_MIN+4)] or [6, INT_MAX] that prevents all useful comparisons with < or >), such as

Changing comparisons x<y to true or false if the ranges for x and y does not overlap

Changing min(x,y) or max(x,y) to x or y if the ranges do not overlap

Changing abs(x) to x or -x if the range does not cross 0

Changing x/c to x>>log2(c) if x>0 and the constant c is a power of 2

Changing x%c to x&(c-1) if x>0 and the constant c is a power of 2

循环分析和优化

The canonical example of why undefined signed overflow helps loop optimizations is that loops like
for (int i = 0; i <= m; i++)
are guaranteed to terminate for undefined overflow. This helps architectures that have specific loop instructions, as they do in general not handle infinite loops.

But undefined signed overflow helps many more loop optimizations. All analysis such as determining number of iteration, transforming induction variables, and keeping track of memory accesses are using everything in the previous sections in order to do its work. In particular, the set of loops that can be vectorized are severely reduced when signed overflow is allowed.

Answer 4

这是一个真正的小基准，冒泡排序。我比较了时间 without/with -fwrapv（这意味着溢出是 UB/not UB）。以下是结果（秒）：

                   -O3     -O3 -fwrapv    -O1     -O1 -fwrapv
Machine1, clang    5.2     6.3            6.8     7.7
Machine2, clang-8  4.2     7.8            6.4     6.7
Machine2, gcc-8    6.6     7.4            6.5     6.5

如您所见，not-UB (-fwrapv) 版本几乎总是较慢，最大的差异相当大，1.85x。

这是代码。请注意，我有意选择了一个实现，它应该会为此测试产生更大的差异。

#include <stdio.h>
#include <stdlib.h>

void bubbleSort(int *a, long n) {
        bool swapped;
        for (int i = 0; i < n-1; i++) {
                swapped = false;
                for (int j = 0; j < n-i-1; j++) {
                        if (a[j] > a[j+1]) {
                                int t = a[j];
                                a[j] = a[j+1];
                                a[j+1] = t;
                                swapped = true;
                        }
                }

                if (!swapped) break;
        }
}

int main() {
        int a[8192];

        for (int j=0; j<100; j++) {
                for (int i=0; i<8192; i++) {
                        a[i] = rand();
                }

                bubbleSort(a, 8192);
        }
}

是否有一些有意义的统计数据来证明保持有符号整数算术溢出未定义是合理的？

Is there some meaningful statistical data to justify keeping signed integer arithmetic overflow undefined?

c

c++

signed

integer-overflow

language-lawyer

有符号整数表达式简化

指针运算和类型提升

未签名版

签名版

数值范围计算

循环分析和优化