Python: 加速 pow(base,exp,mod) 固定 exp 和 mod, 或者矢量化

Question

我的代码的瓶颈是对非常大的整数重复调用 pow(base,exponent,modulus)（numpy 不支持这么大的整数，大约 100 到 256 位）。但是，我的指数和模数始终相同。我能以某种方式利用它来加快自定义函数的计算速度吗？我尝试定义一个函数，如下所示（下面的函数用于一般模数和指数）。

但是，即使我在没有 while 循环和 if 语句的情况下对固定指数和模数的每个操作进行硬编码，它也比 pow 慢。

def modular_pow(self, base, exponent, modulus):
    result = 1
    base = base % modulus
    while exponent > 0:
        if (exponent % 2 == 1):
            result = (result * base) % modulus
        exponent = exponent >> 1
        base = (base * base) % modulus
    return result

另一种选择是，如果我能以某种方式“矢量化”它。我必须计算大约 100000000 个不同基值的 pow。虽然这些值经常在我的脚本运行之间变化（因此查找 table 没有用），但我会在我点击运行时知道这些值（我可以计算它们全部一次）。

有什么想法吗？我通过使用来自 gmpy2 的 mpz 数据类型获得了一些加速，但它仍然太慢了。

Answer 1

正在查看 wiki page。似乎您的实施不正确。将这两个语句从 else 中移出可以显着提高性能。

这是我从维基百科得到的

def modular_pow(base, exponent, modulus):
    if modulus == 1:
        return 0
    else:
        result = 1
        base = base % modulus
        while exponent > 0:
            if exponent % 2 == 1:
                result = (result * base) % modulus
            exponent = exponent >> 1
            base = (base * base) % modulus
        return result

输出：

print(modular_pow(4, 13, 497))

445

Answer 2

Good news, bad news. Good news is that when the modulus m is fixed, there are ways to speed computing a*b % m. Search for "Barrett reduction" and "Montgomery reduction". They work, in different ways, by precomputing constants related to m such that % m can be computed via multiplication and shifting, without needing division.

The bad news: to find the remainder, both ways require (in addition to cheaper operations) two multiplications. So they don't pay overall unless multiplication is much cheaper than division.

For that reason, they're generally slower unless the modulus is "truly" large - your "about 100 to 256 bits" is still on the small side by modern standards, just a few times wider than native 64-bit machine ints. Things like speedy FFT-based multiplication require much larger integers before they pay off.

CPython's built-in modular pow is already using a "binary" scheme, along the lines of what you coded in Python, but fancier (if the exponent is "large enough", the built-in pow views it as being in base 32, consuming 5 exponent bits per loop iteration).

In a quick stab implementing Montgomery reduction in Python, and replacing the modular multiplies in your code by the Montgomery spellings, modular_pow() didn't get faster than the built-in before the modulus grew to tens of thousands of bits. For inputs around 256 bits, it was about 3x slower.

That's a mixed bag: the Python code didn't exploit the "base 32" tricks, which can give substantial benefits. But for large enough inputs, CPython uses faster-than-naïve Karatsuba multiplication, which the division-free Montgomery spelling can benefit from (CPython's int division has no speedup tricks regardless of input sizes, and CPython's built-in modular pow always uses division to find remainders).

So, short course: there's nothing I know of obvious you can do in CPython to speed a single instance of pow(a, b, c). It's possible some C-coded crypto library out there has something suitable, but not that I know of.

But the other good news is that your problem is "embarrassingly parallel". If you have N processors, you can give each of them 100000000/N of your inputs, and they can all 运行 full speed in parallel. That would give a speedup of about a factor of N.

But the bad news is that your integers really aren't "large" (they're small enough that I bet you can still compute thousands of modular pows per second with the built-in pow), and interprocess communication costs may wipe out the benefits of doing N computations in parallel. That all depends on exactly how you get your inputs and what you want to do with the results.

FOLLOW UP

The "Handbook of Applied Cryptography" (HAC), chapter 14, essentially spells out the state of the art for gonzo modular exponentiation algorithms.

Looking into the code, GMP already implements every trick they have. This includes the things I mentioned (Montgomery reduction, and using a power-of-2 base higher than 2 to chew up more exponent bits per loop iteration). And others I didn't mention (e.g., GMP has a special internal routine for squaring, which saves cycles over a general product of possibly unequal integers). In all, it's a small mountain of implementation code.

I expect that's why you're not getting more answers: GMP is already doing, at worst, close to the best anyone has ever figured out how to do. The speedup for you isn't truly dramatic because, as already noted, the integers you're using are actually on the small side.

So if you need to pursue this, using GMP is likely the fastest you're going to get. As noted, multiprocessing is an obvious way to get a theoretical N-fold speedup with N processors, but as also noted you haven't said anything about the context (where these inputs come from or what you need to do with the outputs). So it's impossible to guess whether that might pay off for you. The more interprocess communication you need, the more that damages potential multiprocessing speedups.

Note: what you're doing is exactly what, e.g., RSA public-key cryptosystems do, although they generally use larger integers. That is, your "base" is their "message", and a public (or private) RSA key consists of a fixed exponent and fixed modulus. Only the base (message or encrypted bits) varies across en/de-cryption instances. For a given key, the exponent and modulus are always the same.

Many world-class mathematicians have studied the problem, and world-class hackers coded the algorithms for peak speed. That's why you should give up hope on that there's a faster way that HAC just forgot to mention ;-)

Speculative

Drawing the connection to RSA reminded me: RSA decryption, in practice, does not proceed in "the obvious" way. Instead, the holder of the private key knows the prime factorization of the key's modulus (in RSA, the modulus is the product of two distinct - but kept secret -large primes), and that can be used to significantly speed exponentiation with respect to that modulus.

So (can't guess), if the way you get your modulus instances is such that you can compute their prime factorizations efficiently, that can be used to get significant speedups, when they're composite.

Not so much for a prime modulus, though. About the only highly potentially valuable trick then is that, for p prime and a not divisible by p,

pow(a, b, p) == pow(a, b % (p-1), p)

That can save unbounded time if b can be much larger than p. It works because, by Fermat's little theorem,

pow(a, p-1, p) == 1

for p prime and a not divisible by p.例如，

>>> p
2347
>>> assert all(p % i != 0 for i in range(2, p))  # that is, p is prime
>>> pow(345, 1000000000000000000000000000000, p)
301
>>> 1000000000000000000000000000000 % (p-1)
1198
>>> pow(345, 1198, p) # same thing, but much faster
301

For a composite modulus, much the same is done for each of its prime power factors, and then the results are pasted together via the Chinese remainder theorem.

If you think your problem can be worked to exploit that, search for "modular exponentiation Chinese remainder" to find a number of good expositions.

Answer 3

您可以使用加窗 NAF 方法预计算 a^2, a^3,...,a^(2^w-1)。现在，您有 n/w 轮产品，而不是 n 产品和平方。例如，在 256 位 modexp 中，w=4，我们进行 6 次预计算。但是我们有 256/4=64 个乘积，而不是 256 个平方和乘积。以 14 次预计算为代价，这是一笔可观的节省。现在 4 位是 2^4=16 个可能的值。但是 NAF 在范围 -w+1..w-1 内表示这些。注意负指数需要 a^(-1) 的模逆。因此，仅使用编码来缩放正值比额外乘法或需要计算模逆更优化。注意a^0和a^1不需要计算。

一些预计算可能会根据 NAF 形式的指数进行优化，因为您事先知道确切需要哪些值。

w 值必须根据求幂而不是大小进行调整。最佳值可以根据 n/w vs. 2^w-1 计算，或者凭经验确定。

我很惊讶问题的固定指数方面尚未得到解决。还有一些关于这个确切主题的论文。椭圆曲线标量乘法中使用了相同的技术，尽管通常与底数相似的点是固定的，而不是与指数等价的标量。同样的方法适用于固定基数，但预计算可以离线完成并更有效地重用，而使用指数时，它们是动态完成的。

Answer 4

我决定在Python（没有C/C++）中实现和比较几种方法。提前告诉我，与使用 Python 的 pow(a, b, c) 相比，按照我的代码我达到了 25x 倍的加速，有些方法是并行的，有些不是。

注意：请参阅下面我的 post 的更新第 2 部分和 C++ 代码。

请参阅此 post 最底部的代码。下面的时间适用于我的 i7 笔记本电脑，它有 4 个内核，频率为 1.2 Ghz（8 个硬件线程）。

Python - 纯 Python 单进程。此方法使用 pow(a, b, c) 内置 python 函数进行列表理解。
Python_Parallel - 多进程纯Python。与方法 1) 相同但并行。此方法和其他方法使用 multiprocessing 模块占用所有 CPU 个核心。
Gmp - GMPY2-based single-process solution. This uses powmod() 函数和列表理解。
Gmp_Parallel - 基于并行 GMPY2。使用方法 3) 但在所有内核上。
Redc_Div_Parallel - 二进制求幂（也称为 Exponentiation by Squarings) with regular modulus computation through % mod operation. This and further methods also use Numpy arrays with Python objects (object dtype). This and further methods implement Windowed Method 以提高求幂。
Redc_Barrett_Parallel - 与 5) 相同，但使用 Barrett Reduction 用乘法代替除法。
Redc_Montgomery_Parallel - 与 5) 相同，但使用 Montgomery Reduction 代替除法。

时间：

Python                    : time  78.74 sec, boost  1.00
Python_Parallel           : time  19.09 sec, boost  4.12, Correct
Gmp                       : time  27.76 sec, boost  2.84, Correct
Gmp_Parallel              : time   7.57 sec, boost 10.40, Correct
Redc_Div_Parallel         : time  17.38 sec, boost  4.53, Correct, w = 4 (324 redcs)
Redc_Barrett_Parallel     : time  33.15 sec, boost  2.38, Correct, w = 4 (324 redcs)
Redc_Montgomery_Parallel  : time  42.24 sec, boost  1.86, Correct, w = 4 (324 redcs)

与 Python（单线程）基础相比，上述所有计时均有所提升。与 Python 输出相比，检查所有计算结果的正确性。

可以看出Gmp_Parallel的提升最大，而且很简单，就两行代码。可能在 Python-only 解决方案中它是最好和最简单的。

当然，如果您将 Redc_... 方法转换为 C/C++，它们应该变得更快，并且可能比 GMP C 方法 mpz_powm (which is a base of GMPY2's powmod()). You can use fixed-arithmetics _ExtInt type available in Clang, just do using u512 = unsigned _ExtInt(512); and you'll get fixed-size 512-bit integer type for free, you can even put any large non-even number of bits like _ExtInt(12345). Also Boost 具有 uint128_t 更快/uint256_t/uint512_t 在 C++ 中可用。

我可以想象 Barrett/Montgomery 解决方案较慢的唯一原因只是因为它们是 Python 基于对象的，所以大部分时间花在组织 Python 交互上而不是做没有除法的简单 mul/shift 操作。因此，如果您使用 C++，您将获得很大的速度提升。

在我下面的代码中，您可以调整变量 num_bits 和 pow_bits 来设置数字的大小（以位为单位）和幂的大小，还可以 cnt_nums 设置数组的大小（数字要升级的基地数量）。

Try it online!

def PowModPython(bs, e, m):
    return [pow(b, e, m) for b in bs]

def PowModGmp(bs, e, m):
    import gmpy2, numpy as np
    return [gmpy2.powmod(b, e, m) for b in bs]
    #return list(np.vectorize(lambda x: gmpy2.powmod(x, e, m))(np.asarray(bs, dtype = object)))

def PowModParallel(func, bs, e, m, *, processes = None, extra = ()):
    import multiprocessing as mp, functools
    if processes is None:
        processes = mp.cpu_count()
    with mp.Pool(processes) as pool:
        try:
            block = (len(bs) + processes * 8 - 1) // (processes * 8)
            return functools.reduce(
                lambda a, b: a + b,
                pool.starmap(func,
                    [(bs[i : i + block], e, m) + extra for i in range(0, len(bs), block)]),
                []
            )
        finally:
            pool.close()
            pool.join()

def PowModRedc(bs, e, m, bits, redc_type = None, w = None):
    import gmpy2, numpy as np, math

    def Int(x):
        return gmpy2.mpz(x)
    def Mask(x, s):
        #return x & ((1 << s) - 1)
        return np.vectorize(lambda x: gmpy2.t_mod_2exp(x, s))(x)

    e = Int(e)
    m = Int(m)
    
    def CreateMod():
        if redc_type == 'div':
            def Mod(x):
                return x % m
            def To(x):
                return x
            def From(x):
                return x
            def Adjust(x):
                return x
        else:
            if redc_type is None and m & 1 != 0 or redc_type == 'montgomery':
                # https://www.nayuki.io/page/montgomery-reduction-algorithm

                assert m & 1 != 0, 'Montgomery can not handle even modulus!'

                def MontgomeryPrec(n, s):
                    n, s = Int(n), Int(s)
                    r = 1 << s
                    r2 = (r * r) % n
                    ri = pow(r, -1, n)
                    k = (r * ri - 1) // n
                    return n, s, Int(k), Int(r2), Int(r - 1)
                
                Mg = MontgomeryPrec(m, bits + 2)
                
                def Mod(x):
                    n, s, k, _, mask = Mg
                    t = Mask(Mask(x, s) * k, s)
                    u = (x + t * n) >> s
                    return u
                    
                def To(x):
                    _, _, _, r2, _ = Mg
                    return Mod(x * r2)

                def From(x):
                    return Mod(x)
            elif redc_type == 'barrett':
                # https://www.nayuki.io/page/barrett-reduction-algorithm

                assert m > 0 and m & (m - 1) != 0, 'Modulus is a power of two! Not possible for Barrett.'

                def BarrettPrec(n, s):
                    n, s = Int(n), Int(s)
                    return n, s, (1 << s) // n

                Br = BarrettPrec(m, bits * 2 + 2)

                def Mod(x):
                    n, s, r = Br
                    q = (x * r) >> s
                    u = x - q * n
                    return u

                def To(x):
                    return x
                
                def From(x):
                    return x
            else:
                assert False, f'Unknown reduction type {redc_type}!'

            def Adjust(x):
                gm = x >= m
                x[gm] -= m
                return x

        return Mod, To, From, Adjust
    
    def Arr(x):
        return np.array([Int(e) for e in  x], dtype = object)
    
    Mod, To, From, Adjust = CreateMod()
    bs = To(Mod(To(Arr(bs))))

    if w is None:
        w = OptW(e)[1]
    
    def PreComputePowers(d, w, bs):
        d = Int(d)
        digits = []
        mask = 2 ** w - 1
        while d != 0:
            digits.append(d & mask)
            d >>= w
        digits = digits[::-1]
        powers = [None, np.copy(bs)]
        for i in range(2, 2 ** w):
            powers.append(Mod(powers[-1] * bs))
        return digits, powers
    
    digits, powers = PreComputePowers(e, w, bs)
    res = powers[digits[0]]
    
    for digit in digits[1:]:
        for i in range(w):
            res = Mod(res * res)
        if digit != 0:
            res = Mod(res * powers[digit])
    
    return list(Adjust(From(res)))

def OptW(e):
    n = max(1, e.bit_length())
    minv = None
    for w in range(1, 11):
        # Number of multiplication-reductions
        c = (2 ** w - 2) + ((n - 1) // w) * (w + 1)
        if minv is None or c < minv[0]:
            minv = (c, w)
    return minv

def Test():
    import time, random, numpy as np
    random.seed(0)
    
    num_bits = 496
    pow_bits = 252
    cnt_nums = 1 << 16
    cnt_pows = 1 << 0
    
    def Rand(bits):
        return random.randrange(1 << bits)
    
    bs = [Rand(num_bits) for i in range(cnt_nums)]
    em = [(Rand(pow_bits), Rand(num_bits) | 1) for i in range(cnt_pows)]

    opt_w = OptW(em[0][0])
    ref_tim, res = None, None
    for fname, f, w in [
        ('Python', PowModPython, None),
        ('Python_Parallel', lambda bs, e, m: PowModParallel(PowModPython, bs, e, m), None),
        ('Gmp', PowModGmp, None),
        ('Gmp_Parallel', lambda bs, e, m: PowModParallel(PowModGmp, bs, e, m), None),
        ('Redc_Div_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'div')), opt_w),
        ('Redc_Barrett_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'barrett')), opt_w),
        ('Redc_Montgomery_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'montgomery')), opt_w),
    ]:
        tim = time.time()
        for e, m in em:
            resc = f(bs, e, m)
        tim = time.time() - tim
        if ref_tim is None:
            ref_tim = tim
        print(f'{fname:<26}: time {tim:>6.02f} sec, boost {ref_tim / tim:>5.02f}', end = '', flush = True)
        if res is not None:
            assert np.all(np.equal(res, resc))
            print(', Correct', end = '')
        else:
            res = resc
        print(f', w = {w[1]} ({w[0]} redcs)' if w is not None else '')

if __name__ == '__main__':
    Test()

更新。还与上面使用的 Cython to adopt it to run in Python. See following code at the bottom. Also I used Sliding Window approach instead of Regular Window 一起实现了 C++ 版本。

C++ 并行版本（基于蒙哥马利）似乎比上面实现的 GMPY2 并行版本快 2x 倍（并且比 Python 单线程版本快 19x 倍）。 Barrett 版本比并行 GMP 快 1.1x 倍。下面的精确时间，请参阅 Cython_... 变体，它们是最快的：

reg_win = 4 (328 redcs), slide_win = 5 (320 redcs)

Python                           : time  78.90 sec, boost  1.00
Python_Parallel                  : time  20.22 sec, boost  3.90, Correct
Gmp                              : time  27.81 sec, boost  2.84, Correct
Gmp_Parallel                     : time   8.19 sec, boost  9.63, Correct
Redc_Div_Parallel                : time  17.42 sec, boost  4.53, Correct
Redc_Barrett_Parallel            : time  33.08 sec, boost  2.39, Correct
Redc_Montgomery_Parallel         : time  43.05 sec, boost  1.83, Correct
Cython_Redc_Barrett_Parallel     : time   7.71 sec, boost 10.23, Correct
Cython_Redc_Montgomery_Parallel  : time   4.15 sec, boost 19.00, Correct

C++版本使用了C++20标准的特性。它还需要 Clang compiler, because it uses _ExtInt(N) type available in Clang only. For long arithmetics it uses Boost. For parallelization it uses OpenMP 编译器内置的技术。

Cython 包应该安装在 Python 中，因为 Cython 负责包装和编译 C++。通过 python -m pip install cython.

安装

在 Windows 上，您可以从 releases page, download and install LLVM-13.0.0-win64.exe. Boost can be installed by Chocolatey, it is a great package manager for Windows, or downloaded from Boost website. To install it with Chocolatey just type choco install boost-msvc-14.2 in console. Also on Windows you should install Visual Studio、2019 或 2022 版本安装 Clang，社区变体就足够了。 Visual Studio 是必需的，因为 Clang 依赖于它在 Windows 上的库，仅在 Linux/MacOS 上 Clang 是独立的。

在 Linux 下，Clang 和 Boost 可以通过 sudo apt install clang libboost-all-dev 安装。

在我下面的代码中，您可以在 cython_import() 函数中看到 clang_dir = 'c:/bin/llvm/bin/', boost_dir = None。您应该只在 Windows 下设置这两个选项，对于 Linux 将它们设置为 clang_dir = '', boost_dir = ''。 Chocolatey 默认将 Boost 安装到 c:/local/boost_1_74_0/ 目录。

此处来源。使用 C++ 的源代码对于 Whosebug post 大小限制（30 000 个字符）来说似乎太大了，因此我将这段代码上传到 PasteBin，它可以下载 from here (Tio.run mirror is here，它不可运行）。

更新 2。又实现了一个版本，也是在 C++/Cython 中。现在它完全基于 GMP（C 库）。计算时没有Python交互。

看来这个解决方案非常简单，而且速度最快。提醒一下我上面用 C++ 最快的解决方案是基于 Montgomery Reduction 和滑动 Window 二进制指数的变体，该解决方案比纯方法快 19x 倍Python 单线程变体。

当前基于 C++/GMP 的解决方案比单线程 Python 快 25.4x 倍。当然，它是使用 OpenMP 并行化的。它的代码非常简单，由于 Whosebug post 大小限制，我只在下面提供了这个单个 C++ 函数的代码。下面还提供了完整代码，托管在外部 PasteBin 网络服务上。

所有方法的精确计时（新方法被调用Cython_Gmp_Parallel）：

reg_win = 4 (330 redcs), slide_win = 5 (321 redcs)

Python                           : time 162.81 sec, boost  1.00
Python_Parallel                  : time  37.92 sec, boost  4.29, Correct
Gmp                              : time  56.57 sec, boost  2.88, Correct
Gmp_Parallel                     : time  15.13 sec, boost 10.76, Correct
Redc_Div_Parallel                : time  34.81 sec, boost  4.68, Correct
Redc_Barrett_Parallel            : time  67.23 sec, boost  2.42, Correct
Redc_Montgomery_Parallel         : time  86.69 sec, boost  1.88, Correct
Cython_Redc_Barrett_Parallel     : time  13.96 sec, boost 11.66, Correct
Cython_Redc_Montgomery_Parallel  : time   8.31 sec, boost 19.58, Correct
Cython_Gmp_Parallel              : time   6.40 sec, boost 25.42, Correct

完整代码 托管在 PasteBin is here。

而且只是一个C++函数的代码，很短：

#include <gmpxx.h>
// .... And several other includes

void PowModGmpC(
    uint64_t * bs0, uint64_t bs_words, uint64_t cnt, const uint64_t * e0,
    uint64_t e_words, const uint64_t * m0, uint64_t m_words) {
    
    auto ToL = [](void const * x, size_t c) {
        mpz_class r;
        mpz_import(r.get_mpz_t(), c / 8, -1, 8, 0, 0, x);
        return r;
    };
    auto ToW = [](mpz_class const & x, void * y, size_t c) {
        size_t cnt = 0;
        mpz_export(y, &cnt, -1, 8, 0, 0, x.get_mpz_t());
    };
    
    mpz_class const
        e = ToL(e0, e_words * sizeof(u64)),
        m = ToL(m0, m_words * sizeof(u64));
    
    std::span<u64> bs(bs0, bs_words * cnt);
    
    #pragma omp parallel for
    for (size_t i = 0; i < cnt * bs_words; i += bs_words) {
        mpz_class b = ToL(&bs[i], bs_words * sizeof(u64)), r;
        mpz_powm(r.get_mpz_t(), b.get_mpz_t(), e.get_mpz_t(), m.get_mpz_t());
        ToW(r, &bs[i], bs_words * sizeof(u64));
    }
}

Python: 加速 pow(base,exp,mod) 固定 exp 和 mod, 或者矢量化

Python: speed up pow(base,exp,mod) for fixed exp and mod, or with vectorization

python

modulo

exponentiation

modular-arithmetic

FOLLOW UP

Speculative