内置mod('%') vs 自定义mod函数：提高modulus运算的性能

Question

最近我了解到 mod('%') 运算符非常慢。所以我做了一个函数，它会像 a%b 一样工作。但是它比 mod 运算符快吗？

这是我的函数

int mod(int a, int b)
{
    int tmp = a/b;
    return a - (b*tmp);
}

Answer 1

大多数时候，你的微优化代码不会打败编译器。我也不知道 "wisdom" 来自哪里，声称内置 % 很慢。它与机器能够计算的速度一样快 - 编译器可以为您进行所有微优化。

另请注意，对如此小的代码段进行性能测量并非易事。循环结构的条件或时间测量的抖动可能会影响您的结果。您可以找到一些人就此类问题进行的讨论，例如Andrei Alexantrescu 或 youtube 上的 Chandler Caruth。我曾经为我正在从事的项目编写了一个微型基准测试框架。确实有很多需要关心的事情，包括外部的东西，比如 OS 抢占你的线程，或者将它移动到另一个核心。

Answer 2

根据 Chandler Carruth's benchmarks at CppCon 2015，最快的模运算符 （在 x86 上，当使用 Clang 编译时） 是：

int fast_mod(const int input, const int ceil) {
    // apply the modulo operator only when needed
    // (i.e. when the input is greater than the ceiling)
    return input >= ceil ? input % ceil : input;
    // NB: the assumption here is that the numbers are positive
}

我建议你观看整个演讲，他详细介绍了为什么这种方法比无条件地使用 % 更快。

Answer 3

这可能取决于编译器和平台。

但我很感兴趣，在我的系统上，您的基准测试似乎是正确的。但是中的方法最快：

#include <chrono>
#include <iostream>

class Timer
{
    using clk = std::chrono::steady_clock;
    using microseconds = std::chrono::microseconds;

    clk::time_point tsb;
    clk::time_point tse;

public:

    void clear() { tsb = tse = clk::now(); }
    void start() { tsb = clk::now(); }
    void stop() { tse = clk::now(); }

    friend std::ostream& operator<<(std::ostream& o, const Timer& timer)
    {
        return o << timer.secs();
    }

    // return time difference in seconds
    double secs() const
    {
        if(tse <= tsb)
            return 0.0;
        auto d = std::chrono::duration_cast<microseconds>(tse - tsb);
        return d.count() / 1000000.0;
    }
};

int mod(int a, int b)
{
    int tmp=a/b;
    return a-(b*tmp);
}

int fast_mod(const int input, const int ceil) {
    // apply the modulo operator only when needed
    // (i.e. when the input is greater than the ceiling)
    return input < ceil ? input : input % ceil;
    // NB: the assumption here is that the numbers are positive
}

int main()
{
    auto N = 1000000000U;
    unsigned sum = 0;

    Timer timer;

    for(auto times = 0U; times < 3; ++times)
    {
        std::cout << "     run: " << (times + 1) << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += n % (N - n);
        timer.stop();

        std::cout << "       %: " << sum << " " << timer << "s" << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += mod(n, N - n);
        timer.stop();

        std::cout << "     mod: " << sum << " " << timer << "s" << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += fast_mod(n, N - n);
        timer.stop();

        std::cout << "fast_mod: " << sum << " " << timer << "s" << '\n';
    }
}

构建： GCC 5.1.1 (x86_64)

g++ -std=c++14 -march=native -O3 -g0 ...

输出：

     run: 1
       %: 3081207628 5.49396s
     mod: 3081207628 4.30814s
fast_mod: 3081207628 2.51296s
     run: 2
       %: 3081207628 5.5522s
     mod: 3081207628 4.25427s
fast_mod: 3081207628 2.52364s
     run: 3
       %: 3081207628 5.4947s
     mod: 3081207628 4.29646s
fast_mod: 3081207628 2.56916s

Answer 4

在程序员了解编译器不了解的操作数的情况下，程序员通常有可能击败余数运算的性能。例如，如果基数可能是 2 的幂，但是不是特别可能大于要减少的值，一个可以使用类似的东西：

unsigned mod(unsigned int x, unsigned int y)
{
  return y & (y-1) ? x % y : x & (y-1);
}

如果编译器对函数进行内联扩展并且基数是常量 2 的幂，编译器将用按位替换余数运算符 AND，这很容易成为一项重大改进。在基础不是的情况下一个恒定的二的幂，生成的代码需要做一点在选择是否使用余数运算符之前的计算，但是在基数恰好是 2 的幂的情况下，节省的成本按位与可能超过条件逻辑的成本。

自定义模数函数可能有用的另一种情况是基数是一个固定常量，编译器没有规定计算其余的。例如，如果要在平台上计算 x % 65521 它可以执行快速整数移位和乘法运算，可以观察到计算 x -= (x>>16)*65521; 会导致 x 小得多，但不会影响 x % 65521 的值。第二次执行该操作会将 x 减小到 0..65745 的范围——足够小以至于单个条件减法将产生正确的余数。

一些编译器可能知道如何使用此类技术以常量基有效地处理 % 运算符，但对于那些不知道的人来说，这种方法可能是一种有用的优化，尤其是在处理更大的数字时而不是机器字 [注意 65521 是 65536-15，因此在 16 位机器上可以将 x 计算为 x = (x & 65535) + 15*(x >> 16)。不如减去 65521 * (x >> 16) 的形式可读，但很容易看出它如何在 16 位机器上有效地处理。

Answer 5

只是为这次讨论贡献一点力量。如果要处理负数，使用下面的函数：

inline long long mod(const long long x, const long long y) {
    if (x >= y) {
        return x % y;
    } else if (x < 0) {
        return (x % y + y) % y;
    } else {
        return x;
    }
}

内置mod('%') vs 自定义mod函数：提高modulus运算的性能

Built-in mod ('%') vs custom mod function: improve the performance of modulus operation

c++

modular-arithmetic