循环中的矢量化比 numba jitted 函数中的嵌套循环慢

Question

因此，我正在试验 numba 中由 @njit 提供支持的矢量化和 for 循环相结合的性能提升（我目前正在使用 numba 0.45.1）。令人失望的是，我发现它实际上比我代码中的纯嵌套循环实现要慢。

这是我的代码：

import numpy as np
from numba import njit

@njit
def func3(arr_in, win_arr):
    n = arr_in.shape[0]
    win_len = len(win_arr)

    result = np.full((n, win_len), np.nan)

    alpha_arr = 2 / (win_arr + 1)

    e = np.full(win_len, arr_in[0])
    w = np.ones(win_len)

    two_index = np.nonzero(win_arr <= 2)[0][-1]+1
    result[0, :two_index] = arr_in[0]

    for i in range(1, n):
        w = w + (1-alpha_arr)**i
        e = e*(1-alpha_arr) + arr_in[i]
        result[i,:] = e /w

    return result

@njit
def func4(arr_in, win_arr):
    n = arr_in.shape[0]
    win_len = len(win_arr)

    result = np.full((n, win_len), np.nan)

    alpha_arr = 2 / (win_arr + 1)

    e = np.full(win_len, arr_in[0])
    w = np.ones(win_len)

    two_index = np.nonzero(win_arr <= 2)[0][-1]+1
    result[0, :two_index] = arr_in[0]

    for i in range(1, n):
        for col in range(len(win_arr)):
            w[col] = w[col] + (1-alpha_arr[col])**i
            e[col] = e[col]*(1-alpha_arr[col]) + arr_in[i]
            result[i,col] = e[col] /w[col]

    return result

if __name__ == '__main__':
    np.random.seed(0)
    data_size = 200000
    winarr_size = 1000

    data = np.random.uniform(0,1000, size = data_size)+29000
    win_array = np.arange(1, winarr_size+1)

    abc_test3= func3(data, win_array)
    abc_test4= func4(data, win_array)

    print(np.allclose(abc_test3, abc_test4, equal_nan = True))

我使用以下配置对这两个函数进行了基准测试：

(data_size,winarr_size) = (200000,100), (200000,200),(200000,1000), (200000,2000), (20000,10000), (2000,100000).

并发现纯嵌套 for 循环实现 (func4) 始终比 for 循环与矢量化混合的实现更快（大约快 2-5%）（func3).

我的问题如下：

1）为了进一步提高代码速度还需要改什么？

2) 为什么向量化版本的函数的计算时间随着win_arr的大小线性增长？我认为矢量化应该使它无论big/small矢量如何运行速度都是恒定的，但显然这在这种情况下不成立。

3) 是否存在任何一般条件下向量化运算的计算时间仍会随着输入大小线性增长？

Answer 1

看来你误解了"vectorized"的意思。矢量化意味着您编写的代码在数组上进行操作，就好像它们是标量一样——但这只是代码的样子，与性能无关。

在 Python/NumPy 世界中，矢量化还意味着与循环代码相比，矢量化操作中循环的开销（通常）小得多。然而，矢量化代码仍然必须执行循环（即使它隐藏在库中）！

此外，如果您使用 numba 编写循环，numba 将对其进行编译并创建执行速度（通常）与矢量化 NumPy 代码一样快的代码。这意味着在 numba 函数内部，矢量化代码和非矢量化代码之间没有显着的性能差异。

所以这应该回答你的问题：

2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.

它是线性增长的，因为它仍然要迭代。在向量化代码中，循环只是隐藏在库例程中。

3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?

没有

您还询问了如何才能使其更快。

评论中已经提到可以并行化：

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def func6(arr_in, win_arr):
    n = arr_in.shape[0]
    win_len = len(win_arr)

    result = np.full((n, win_len), np.nan)

    alpha_arr = 2 / (win_arr + 1)

    e = np.full(win_len, arr_in[0])
    w = np.ones(win_len)

    two_index = np.nonzero(win_arr <= 2)[0][-1]+1
    result[0, :two_index] = arr_in[0]

    for i in range(1, n):
        for col in nb.prange(len(win_arr)):
            w[col] = w[col] + (1-alpha_arr[col])**i
            e[col] = e[col] * (1-alpha_arr[col]) + arr_in[i]
            result[i,col] = e[col] /w[col]

    return result

这使得代码在我的机器（4 核）上运行得更快一些。

然而，还有一个问题是您的算法可能在数值上不稳定。 (1-alpha_arr[col])**i 的十万次方会在某个时候下溢：

>>> alpha = 0.01
>>> for i in [1, 10, 100, 1_000, 10_000, 50_000, 100_000, 200_000]:
...     print((1-alpha)**i)
0.99
0.9043820750088044
0.3660323412732292
4.317124741065786e-05
2.2487748498162805e-44
5.750821364590612e-219
0.0  # <-- underflow
0.0

Answer 2

对于复杂的数学运算，如（pow，divisions，...），请务必三思而后行。如果你能用乘法、加法和减法等简单的操作来代替它们，那总是值得一试的。

请注意，重复将 alpha 与自身相乘仅在代数上与直接乘幂计算相同。由于这是数值数学，因此结果可能会有所不同。

还要避免不必要的临时数组。

先试试

@nb.njit(error_model="numpy",parallel=True)
def func5(arr_in, win_arr):
    #filling the whole array with NaNs isn't necessary
    result = np.empty((win_arr.shape[0],arr_in.shape[0]))
    for col in range(win_arr.shape[0]):
        result[col,0]=np.nan

    two_index = np.nonzero(win_arr <= 2)[0][-1]+1
    result[:two_index,0] = arr_in[0]

    for col in nb.prange(win_arr.shape[0]):
        alpha=1.-(2./ (win_arr[col] + 1.))
        alpha_exp=alpha

        w=1.
        e=arr_in[0]

        for i in range(1, arr_in.shape[0]):
            w+= alpha_exp
            e = e*alpha + arr_in[i]
            result[col,i] = e/w
            alpha_exp*=alpha

    return result.T

第二次尝试（避免下溢）

@nb.njit(error_model="numpy",parallel=True)
def func7(arr_in, win_arr):
    #filling the whole array with NaNs isn't necessary
    result = np.empty((win_arr.shape[0],arr_in.shape[0]))
    for col in range(win_arr.shape[0]):
        result[col,0]=np.nan

    two_index = np.nonzero(win_arr <= 2)[0][-1]+1
    result[:two_index,0] = arr_in[0]

    for col in nb.prange(win_arr.shape[0]):
        alpha=1.-(2./ (win_arr[col] + 1.))
        alpha_exp=alpha

        w=1.
        e=arr_in[0]

        for i in range(1, arr_in.shape[0]):
            w+= alpha_exp
            e = e*alpha + arr_in[i]
            result[col,i] = e/w

          if np.abs(alpha_exp)>=1e-308:
              alpha_exp*=alpha
          else:
              alpha_exp=0.

    return result.T

计时

%timeit abc_test3= func3(data, win_array)
7.17 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test4= func4(data, win_array)
7.13 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#from MSeifert answer (parallelized)
%timeit abc_test6= func6(data, win_array)
3.42 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test5= func5(data, win_array)
1.22 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test7= func7(data, win_array)
238 ms ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

循环中的矢量化比 numba jitted 函数中的嵌套循环慢

Vectorization in a loop slower than a nested loop in numba jitted function

python

performance

numpy

vectorization

numba