Numba on pure python VS Numpa on numpy-python

Question

使用 numba 比使用纯 python:

现在看来，纯 python 上的 numba 甚至（大部分时间）比 numpy-python 更快，例如https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/.

根据 https://murillogroupmsu.com/julia-set-speed-comparison/ 在纯 python 代码上使用 numba 比在使用 numpy 的 python 代码上使用更快。这通常是正确的吗？为什么？

在中解释了为什么纯 python 上的 numba 比 numpy-python 更快：与 numpy 相比，numba 看到更多代码并且有更多优化代码的方法看到一小部分。

这是否回答了我的问题？在使用 numpy 时，我是否会阻碍 numba 完全优化我的代码，因为 numba 被迫使用 numpy 例程而不是寻找更优化的方法？我曾希望 numba 能够意识到这一点，并且如果它是无益的则不要使用 numpy 例程。然后它将使用 numpy 例程，只是它是一种改进（毕竟 numpy 已经过很好的测试）。毕竟"Support for NumPy arrays is a key focus of Numba development and is currently undergoing extensive refactorization and improvement."

Answer 1

在回答具体问题之前，让我们先弄清楚一些事情：

对于这个答案，我只会考虑 nopython 代码，对象模式代码通常比纯 Python/NumPy 等效代码慢。
我将忽略此答案的 numba GPU 功能 - 很难将 GPU 上的代码运行与 CPU 上的代码运行进行比较。
当您在 numba 函数中调用 NumPy 函数时，您实际上并不是在调用 NumPy 函数。 numba 支持的所有内容都在 numba 中重新实现。这适用于 NumPy 函数，也适用于 numba 中的 Python 数据类型！因此，numba 函数内部和外部 Python/NumPy 之间的实现细节可能不同，因为它们完全不同 functions/types.
Numba 生成使用 LLVM 编译的代码。 Numba 不是魔法，它只是一个优化编译器的包装器，在 numba 中内置了一些优化！

It seems established by now, that numba on pure python is even (most of the time) faster than numpy-python

没有。 Numba 通常比 NumPy 慢。这取决于你想做什么操作以及你如何做。如果您处理非常小的数组，或者如果唯一的选择是手动遍历数组，Numba 确实更快。

numba used on pure python code is faster than used on python code that uses numpy. Is that generally true and why?

这取决于代码 - 可能有更多 NumPy 击败 numba 的案例。然而，诀窍是在没有相应的 NumPy 函数或需要链接大量 NumPy 函数或使用不理想的 NumPy 函数的地方应用 numba。诀窍是知道什么时候 numba 实现可能更快，然后最好不要在 numba 中使用 NumPy 函数，因为你会得到 NumPy 函数的所有缺点。然而，了解 when 和 how 应用 numba 的情况需要经验 - 很容易不小心编写一个非常慢的 numba 函数。

Do I hinder numba to fully optimize my code when using numpy, because numba is forced to use the numpy routines instead of finding an even more optimal way?

是的。

I had hoped that numba would realise this and not use the numpy routines if it is non-beneficial.

不，numba 目前不是这样工作的。 Numba 只是为 LLVM 编译创建代码。也许这是 numba 将来会有的功能（谁知道呢）。目前，如果您自己编写循环和操作并避免在 numba 函数中调用 NumPy 函数，则 numba 的性能最佳。

有一些库使用表达式树并可能优化无益的 NumPy 函数调用 - 但这些库通常不允许快速手动迭代。例如 numexpr 可以优化多个链式 NumPy 函数调用。目前，它要么是快速手动迭代 (cython/numba)，要么是使用表达式树 (numexpr) 优化链式 NumPy 调用。也许甚至不可能在一个图书馆内完成这两项工作 - 我不知道。

Numba 和 Cython 在小型数组和数组上的快速手动迭代方面非常出色。 NumPy/SciPy 很棒，因为它们带有大量复杂的功能，可以开箱即用地完成各种任务。 Numexpr 非常适合链接多个 NumPy 函数调用。在某些情况下 Python 比任何这些工具都快。

根据我的经验，如果您组合不同的工具，您可以充分利用它们。不要局限于一种工具。

Answer 2

According to https://murillogroupmsu.com/julia-set-speed-comparison/ numba used on pure python code is faster than used on python code that uses numpy. Is that generally true and why?

In it is explained why numba on pure python is faster than numpy-python: numba sees more code and has more ways to optimize the code than numpy which only sees a small portion.

Numba 只是用自己的实现替换了 numpy 函数。它们可以是 faster/slower，结果也可能不同。问题是这种替换发生的机制。经常会涉及到不必要的临时数组和循环，可以融合。

循环融合和删除临时数组不是一件容易的事。如果您针对循环融合或单线程目标要好得多的并行目标进行编译，则行为也会有所不同。

[编辑] 优化 Section 1.10.4. Diagnostics (like loop fusing) which are done in the parallel accelerator can in single threaded mode also be enabled by settingparallel=True and nb.parfor.sequential_parfor_lowering = True. 1

例子

#only for single-threaded numpy test
import os
os.environ["OMP_NUM_THREADS"] = "1"

import numba as nb
import numpy as np

a=np.random.rand(100_000_000)
b=np.random.rand(100_000_000)
c=np.random.rand(100_000_000)
d=np.random.rand(100_000_000)

#Numpy version
#every expression is evaluated on its own 
#the summation algorithm (Pairwise summation) isn't equivalent to the algorithm I used below
def Test_np(a,b,c,d):
    return np.sum(a+b*2.+c*3.+d*4.)

#The same code, but for Numba (results and performance differ)
@nb.njit(fastmath=False,parallel=True)
def Test_np_nb(a,b,c,d):
    return np.sum(a+b*2.+c*3.+d*4.)

#the summation isn't fused, aprox. the behaiviour of Test_np_nb for 
#single threaded target
@nb.njit(fastmath=False,parallel=True)
def Test_np_nb_eq(a,b,c,d):
    TMP=np.empty(a.shape[0])
    for i in nb.prange(a.shape[0]):
        TMP[i]=a[i]+b[i]*2.+c[i]*3.+d[i]*4.

    res=0.
    for i in nb.prange(a.shape[0]):
        res+=TMP[i]

    return res

#The usual way someone would implement this in Numba
@nb.njit(fastmath=False,parallel=True)
def Test_nb(a,b,c,d):
    res=0.
    for i in nb.prange(a.shape[0]):
        res+=a[i]+b[i]*2.+c[i]*3.+d[i]*4.
    return res

计时

#single-threaded
%timeit res_1=Test_nb(a,b,c,d)
178 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=Test_np(a,b,c,d)
2.72 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=Test_np_nb(a,b,c,d)
562 ms ± 5.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
612 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#single-threaded
#parallel=True
#nb.parfor.sequential_parfor_lowering = True
%timeit res_1=Test_nb(a,b,c,d)
188 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=Test_np_nb(a,b,c,d)
184 ms ± 817 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
185 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#multi-threaded
%timeit res_1=Test_nb(a,b,c,d)
105 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=Test_np(a,b,c,d)
1.78 s ± 75.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=Test_np_nb(a,b,c,d)
102 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
102 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

结果

#single-threaded
res_1=Test_nb(a,b,c,d)
499977967.27572954
res_2=Test_np(a,b,c,d)
499977967.2756622
res_3=Test_np_nb(a,b,c,d)
499977967.2756614
res_4=Test_np_nb_eq(a,b,c,d)
499977967.2756614

#multi-threaded
res_1=Test_nb(a,b,c,d)
499977967.27572465
res_2=Test_np(a,b,c,d)
499977967.2756622
res_3=Test_np_nb(a,b,c,d)
499977967.27572465
res_4=Test_np_nb_eq(a,b,c,d)
499977967.27572465

结论

这取决于用例什么是最好用的。有些算法可以在 Numpy 中用几行代码轻松编写，其他算法很难或不可能以矢量化方式实现。

我这里还特意用了求和的例子。一次完成所有操作很容易编写代码并且速度更快，但如果我想要最精确的结果，我肯定会使用已经在 Numpy 中实现的更复杂的算法。当然，您可以在 Numba 中执行相同的操作，但这需要做更多的工作。