Numba 与 Cython 循环优化

Question

考虑以下四个函数（python、numba、cython 和 smart），它们在给定相同的整数输入时计算相同的响应

def python(n):
    total = 0
    for m in range(1,n+1):
        total += m
    return total

from numba import jit
numba = jit(python)

cpdef int cython(int n):
    cdef int total = 0
    cdef int m
    for m in range(1, n+1):
        total += m
    return total

def smart(n):
    return n * (n + 1) // 2

为他们的执行计时我有点惊讶地发现

numba的运行时间独立于n（而cython的时间在n中是线性的）
numba 比 smart

这立即提出了两个问题：

为什么 Numba 而不是 Cython 能够将其变成恒定时间算法？
鉴于 Numba 确实设法将其转变为恒定时间算法，为什么它比纯 Python 恒定时间函数慢 smart?

因为我不是汇编专家，所以查看生成的代码并没有给我太多线索，除此之外，Numba 生成的中间 LLVM 代码仍然出现（虽然我可能误解了）包含一个循环......我在最终生成的 x64 中绝望地迷失了。（除非有人问，否则我不会 post 生成的代码，因为它们相当长。）

我在 x64 Linux 上运行在 Jupyter notebook 中使用它，所以我怀疑 Cython 使用的是用于编译 Python 的 GCC 4.4.7 ;和 llvmlite 0.20.0，这意味着 LLVM 4.0.x.

编辑：

我也计时了

smart_numba = jit(smart)

和

cpdef int smart_cython(int n):
    return n * (n + 1) // 2

smart_numba 和 numba 给出相同的时间，比 smart（纯 Python）慢 25%比 smart_cython.

慢 175%

这是否表明 Cython 在有效跨越 Python/low-level 边界方面做得很好，而 Numba 做得很差？还是另有原因？

Answer 1

这似乎是 LLVM 与 GCC 的对比 - 请参阅编译器资源管理器中的示例 here, which is less noisy than what numba spits out. I get a bit lost in the assembly, but fairly clear that the GCC output has a loop (the jge to .L6) and the clang output does not. See also this issue 在 GCC bugtracker 上。
在我的机器上 (Windows x64) numba 并不明显比 smart 慢，只有大约 9 ns。这种开销似乎是由于 numba 的类型分派机制造成的——如果你通过选择一个特定的重载来省略它，numba 版本比 python one

这是我的时间安排

In [73]: %timeit numba_sum(10000)
182 ns ± 1.69 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [74]: %timeit smart(10000)
171 ns ± 2.26 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# pick out int64 overload
i64_numba_sum = numba_sum.get_overload((numba.int64,))

In [75]: %timeit i64_numba_sum(10000)
94 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Numba 与 Cython 循环优化

Numba vs Cython loop optimization

python

cython

numba

编辑：