编译的 Numba 函数并不比 CPython 快

Question

我有一个带有 Numba 的编译函数，它根据索引拆分一个数组，这个 returns 一个不规则（可变长度）的 numpy 数组列表。然后将其填充以从不规则列表中形成二维数组。

问题

编译函数 'nb_array2mat' 应该比纯 python 'array2mat' 快得多，但事实并非如此。

此外，使用 numpy 是否可行？

length of the array and index    
1456391 95007

times:
numba:  1.3438396453857422    
python:  1.1407015323638916

我想我没有以正确的方式使用 numba 编译。任何帮助都会很棒。

编辑

现在使用在代码部分编辑的虚拟数据我得到了加速，为什么它不适用于实际数据？

 length of the array and index    
 1456391 95007
    
 times:
 numba:  0.012002706527709961
 python:  0.13403034210205078

代码

idx_split: https://drive.google.com/file/d/1hSduTs1_s3seEFAiyk_n5yk36ZBl0AXW/view?usp=sharing

dist_min_orto: https://drive.google.com/file/d/1fwarVmBa0NGbWPifBEezTzjEZSrHncSN/view?usp=sharing

import time
import numba
import numpy as np
from numba.pycc import CC

cc = CC('compile_func')
cc.verbose = True  
   
@numba.njit(parallel=True, fastmath=True)
@cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def array2mat(arr, idx):
    # split arr by idx indexes
    out = []
    s = 0
    for n in numba.prange(len(idx)):
        e = idx[n]
        out.append(arr[s:e])
        s = e
    # create a 2d array with arr values pading empty values with fill_value=1000000.0
    _len = [len(_i) for _i in out]
    cols = max(_len)
    rows = len(out)
    mat = np.full(shape=(rows, cols), fill_value=1000000.0)
    for row in numba.prange(rows):
        len_col = len(out[row])
        mat[row, :len_col] = out[row]
    return mat


if __name__ == "__main__":
    cc.compile()

    
# PYTHON FUNC
def array2mat(arr, idx):
    # split arr by idx indexes
    out = []
    s = 0
    for n in range(len(idx)):
        e = idx[n]
        out.append(arr[s:e])
        s = e
    # create a 2d array with arr values pading empty values with fill_value=1000000.0
    _len = [len(_i) for _i in out]
    cols = max(_len)
    rows = len(out)
    mat = np.full(shape=(rows, cols), fill_value=1000000.0)
    for row in range(rows):
        len_col = len(out[row])
        mat[row, :len_col] = out[row]
    return mat
    
import compile_func  
#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(float)
idx = np.load('idx_split.npy').astype(int)

# DUMMY DATA
arr = np.random.randint(50, size=1456391).astype(float)
idx = np.cumsum(np.random.randint(5, size=95007).astype(int))
print(len(arr), len(idx))



#NUMBA FUNC
t0 = time.time()
print(compile_func.nb_array2mat(arr, idx))
print(time.time() - t0)

# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)

Answer 1

您不能在第一个循环中使用 nb.prange，因为 out 是在线程之间共享的，并且它们也是 read/written。这会导致 竞争条件 。 Numba 假定迭代之间不存在依赖关系，您有责任保证这一点。最简单的解决方案是这里不使用并行循环

此外，第二个循环主要是 memory-bound 所以我不期望使用多线程会大大加快速度，因为 RAM 是一种吞吐量有限的共享资源（很少有线程通常足以使其饱和，特别是在 PC 上，有时一个线程就足够了）。

希望您不需要创建 out 临时列表，只需创建结束偏移量，以便在并行循环中计算 len_cols。最大值 cols 可以在第一个循环中即时计算。与第二个循环相比，第一个循环应该执行得非常快。在 Linux 上并行填充 新分配的 通常更快，因为 page faults 可以并行完成。 AFAIK，一个 Windows 这是不正确的（当然是因为页面错误扩展得更严重）。这在这里也更好，因为范围 0:len_col 是可变的，因此填充矩阵的这一部分的时间是可变的，导致一些线程在其他线程之后完成（较慢的线程限制执行）。此外，这在 NUMA 机器上通常要快得多，因为每个 NUMA 节点都可以写入自己的内存。

请注意 AOT 编译 not support automatic parallel execution。引用 Numba 开发人员的话：

From discussion in today's triage meeting, related to #7696: this is not likely to be supported as AOT code doesn't require Numba to be installed - this would mean a great deal of work and issues to overcome for packaging the code for the threading layers.

对于 fastmath 同样的事情 applies 也可能会在下一个即将发布的关于当前工作的版本中添加。

请注意，JIT 编译和 AOT 编译是两个独立的过程。因此 njit 的参数不会共享给 cc.export，签名也不会共享给 njit。这意味着由于惰性编译，该函数将在其第一次执行时被编译。话虽如此，该函数已重新定义，因此 njit 在这里毫无用处（被覆盖）。

这里是生成的代码（仅使用带有急切编译的 JIT 实现而不是 AOT 实现）：

import time
import numba
import numpy as np

@numba.njit('f8[:,:](f8[:], i4[:])', fastmath=True)
def nb_array2mat(arr, idx):
    # split arr by idx indexes
    s = 0
    ends = np.empty(len(idx), dtype=np.int_)
    cols = 0
    for n in range(len(idx)):
        e = idx[n]
        ends[n] = e
        len_col = e - s
        cols = max(cols, len_col)
        s = e
    # create a 2d array with arr values pading empty values with fill_value=1000000.0
    rows = len(idx)
    mat = np.empty(shape=(rows, cols))
    for row in numba.prange(rows):
        s = ends[row-1] if row >= 1 else 0
        e = ends[row]
        len_col = e - s
        mat[row, 0:len_col] = arr[s:e]
        mat[row, len_col:cols] = 1000000.0
    return mat

# PYTHON FUNC
def array2mat(arr, idx):
    # split arr by idx indexes
    out = []
    s = 0
    for n in range(len(idx)):
        e = idx[n]
        out.append(arr[s:e])
        s = e
    # create a 2d array with arr values pading empty values with fill_value=1000000.0
    _len = [len(_i) for _i in out]
    cols = max(_len)
    rows = len(out)
    mat = np.full(shape=(rows, cols), fill_value=1000000.0)
    for row in range(rows):
        len_col = len(out[row])
        mat[row, :len_col] = out[row]
    return mat

#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(np.float64)
idx = np.load('idx_split.npy').astype(np.int32)

#NUMBA FUNC
t0 = time.time()
print(nb_array2mat(arr, idx))
print(time.time() - t0)

# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)

在我的机器上，新的 Numba 代码稍微快一些：Numba 实施需要 0.358 秒，Python 实施需要 0.418 秒。事实上，在我的机器上使用顺序 Numba 代码甚至稍微快一点，因为它需要 0.344 秒。

注意输出矩阵的形状是(95007,5469)。因此，该矩阵占用 3.87 GiB 内存。你应该检查你有足够的内存来存储它。事实上，Python 实现在我的机器上占用了大约 7.5 GiB（可能是因为 GC/default-allocator 不直接释放内存）。如果你没有足够的内存，那么系统可以使用非常慢的交换内存（它使用你的存储设备）。此外，x86-64 处理器使用写入分配缓存策略，导致写入的 cache-lines 默认情况下实际读取。非临时写入可用于在大矩阵上避免这种情况。不幸的是，Numpy 和 Numba 都没有在我的机器上使用它。这意味着一半的 RAM 吞吐量被浪费了。更不用说页面错误是相当昂贵的：在顺序上，Numpy 实现的 60% 的时间花在页面错误上。 Numba 代码几乎所有时间都花在写入内存和执行页面错误上。这里有一个related open issue.

Answer 2

基于@Jérôme Richard 的回答，我编写了相同的函数。改进在于创建 mat numpy 数组的方式，如前一个答案所述，np.full 的内存大小需要更长的时间才能运行，因此解决方案是将其初始化为 np.empty.

python 和 numba 之间的改进不大，但是 mat 数组的大小对处理时间有很大影响。

1456391 95007
python:  0.29506611824035645
numba:  0.1800403594970703

代码

@cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def nb_array2mat(arr, idx):
    s = 0
    _len = np.empty(len(idx), dtype=np.int_)
    _len[0] = idx[0]
    _len[1:] = idx[1:] - idx[:-1]

    # create a 2d array
    cols = int(np.max(_len))
    rows = len(idx)
    mat = np.empty(shape=(rows, cols), dtype=np.float_)

    for row in range(len(idx)):
        e = idx[row]
        len_col = _len[row]
        mat[row, :len_col] = arr[s:e]
        s = e
    return mat

编译的 Numba 函数并不比 CPython 快

Compiled Numba function not faster that CPython

python

performance

numpy

numba