为什么 np.hypot 和 np.subtract.outer 与普通广播相比非常快？使用 Numba 并行加速 numpy 进行距离矩阵计算

Question

我有两大组二维点，我需要计算一个距离矩阵。

我需要它在 python 中速度快，所以显然我使用了 numpy。我最近了解了 numpy 广播并使用了它，而不是循环 python numpy 将在 C 中完成。

我真的认为广播就是我所需要的，直到我看到其他方法比普通广播更好用，我有两种计算距离矩阵的方法，但我不明白为什么一种比另一种更好。

我在此处查看 https://github.com/numpy/numpy/issues/14761 我得到了相互矛盾的结果。

下面是距离矩阵的两种计算方式

单元格 [3, 4, 6] 和 [8, 9] 都计算距离矩阵，但是 3+4 使用 subtract.outer 比使用普通广播的 8 和使用 hypot 的 6 快得多比 9 快得多，这是一种简单的方法。我没有尝试 python 循环假设它永远不会完成。

我想知道

1.有没有更快的方法来计算距离矩阵（可能是 scikit-learn 或 scipy）？

2。为什么 hypot 和 subtract.outer 这么快？

为了方便起见，我还附上了片段 tp 运行整个内容，我更改了种子以防止缓存恢复

### Cell 1
import numpy as np

np.random.seed(858442)

### Cell 2
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 2.02 ms, sys: 1.4 ms, total: 3.42 ms
Wall time: 1.84 ms

### Cell 3
%%time
d0 = np.subtract.outer(obs[:,0], interp[:,0])

CPU times: user 2.46 s, sys: 1.97 s, total: 4.42 s
Wall time: 4.42 s

### Cell 4
%%time
d1 = np.subtract.outer(obs[:,1], interp[:,1])

CPU times: user 3.1 s, sys: 2.7 s, total: 5.8 s
Wall time: 8.34 s

### Cell 5
%%time
h = np.hypot(d0, d1)

CPU times: user 12.7 s, sys: 24.6 s, total: 37.3 s
Wall time: 1min 6s

### Cell 6
np.random.seed(773228)

### Cell 7
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 1.84 ms, sys: 1.56 ms, total: 3.4 ms
Wall time: 2.03 ms

### Cell 8
%%time
d = obs[:, np.newaxis, :] - interp
d0, d1 = d[:, :, 0], d[:, :, 1]

CPU times: user 22.7 s, sys: 8.24 s, total: 30.9 s
Wall time: 33.2 s

### Cell 9
%%time
h = np.sqrt(d0**2 + d1**2)

CPU times: user 29.1 s, sys: 2min 12s, total: 2min 41s
Wall time: 6min 10s

更新感谢Jérôme Richard

Whosebug 永远不会让人失望
使用 numba
它有即时编译器，可以将 python 片段转换为快速机器代码，第一次使用它会比后续使用慢一点，因为它会编译。但即使是第一次 njit 并行击败 hypot + subtract.outer (49000, 12000) 矩阵

各种方法的表现

确保每次使用不同的种子运行ning 脚本

import sys
import time

import numba as nb
import numpy as np

np.random.seed(int(sys.argv[1]))

d0 = np.random.random((49000, 2))
d1 = np.random.random((12000, 2))

def f1(d0, d1):
    print('Numba without parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

# Add eager compilation, compiles before hand
@nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
def f2(d0, d1):
    print('Numba with parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

def f3(d0, d1):
    print('hypot + subtract.outer')
    np.hypot(
        np.subtract.outer(d0[:,0], d1[:,0]),
        np.subtract.outer(d0[:,1], d1[:,1])
    )

if __name__ == '__main__':
    s1 = time.time()
    eval(f'{sys.argv[2]}(d0, d1)')
    print(time.time() - s1)

(base) ~/xx@xx:~/xx$ python3 test.py 523432 f3
hypot + subtract.outer
9.79756784439087
(base) xx@xx:~/xx$ python3 test.py 213622 f2
Numba with parallel
0.3393140316009521

我会更新此 post 以进一步开发，如果我发现更快的方法

Answer 1

首先，d0 和 d1 占用每个 50000 x 30000 x 8 = 12 GB，这是相当大的。确保您有超过 100 GB 的内存，因为这是整个脚本所需要的！这是大量内存。如果您没有足够的内存，操作系统将使用存储设备（例如swap）来存储多余的数据，这会慢得多。实际上，没有理由 Cell-4 比 Cell-3 慢，我猜你已经没有足够的内存来（完全）将 d1 存储在 RAM 中，而 d0 似乎适合（大部分）在记忆中。当两者都适合 RAM 时，我的机器没有区别（也可以颠倒操作顺序来检查这一点）。这也解释了为什么进一步的操作往往会变慢。

也就是说，单元格 8+9 也较慢，因为它们创建 临时数组 并且需要 更多内存传递 来计算结果比细胞 3+4+5。事实上，表达式 np.sqrt(d0**2 + d1**2) 首先在内存中计算 d0**2 产生一个新的 12 GB 临时数组，然后计算 d1**2 产生另一个 12 GB 临时数组，然后执行两个临时数组的总和数组生成另一个新的 12 GB 临时数组，最后计算平方根，得到另一个 12 GB 临时数组。这可能需要多达 48 GB 的内存，并且需要 4 次读写内存绑定通道。这效率不高，也没有有效地使用 CPU/RAM（例如 CPU 缓存）。

有一种更快的实现方式，包括使用 Numba 的 JIT 一次性完成整个计算。这是一个例子：

import numba as nb
@nb.njit(parallel=True)
def distanceMatrix(a, b):
    res = np.empty((a.shape[0], b.shape[0]), dtype=a.dtype)
    for i in nb.prange(a.shape[0]):
        for j in range(b.shape[0]):
            res[i, j] = np.sqrt((a[i, 0] - b[j, 0])**2 + (a[i, 1] - b[j, 1])**2)
    return res

此实现使用 3 倍的内存（仅 12 GB），并且比使用 subtract.outer 的实现快得多。事实上，由于交换，Cell 3+4+5 需要几分钟，而这个需要 1.3 秒！

要点内存访问和临时数组一样昂贵。人们需要避免在处理巨大的缓冲区时在内存中使用多次传递，并在执行的计算不是微不足道的时候利用 CPU 缓存（例如通过使用数组块）。

为什么 np.hypot 和 np.subtract.outer 与普通广播相比非常快？使用 Numba 并行加速 numpy 进行距离矩阵计算

Why np.hypot and np.subtract.outer very fast compared to vanilla broadcast ? Using Numba for speedup numpy in parallel for distance matrix calculation

python

numpy

vectorization

python-3.x

numba

更新感谢Jérôme Richard

各种方法的表现

我会更新此 post 以进一步开发，如果我发现更快的方法