NumPy ufunc 在一个轴上比另一个轴快 2 倍

Question

我正在做一些计算，并测量了 ufuncs 在不同轴上的性能，例如 np.cumsum，以使代码性能更高。

In [51]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)

In [52]: %timeit arr.cumsum(axis=1)
2.27 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %timeit arr.cumsum(axis=0)
4.16 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

cumsum over axis 1 is almost 2x faster than cumsum 在轴 0 上。为什么会这样，幕后发生了什么？清楚地了解其背后的原因会很好。谢谢！

Update：经过一些研究，我意识到如果有人正在构建一个应用程序，他们总是只在某个轴上 sum ，那么数组应该以适当的顺序初始化：即 C-order for axis=1 sums 或 Fortran-order for axis=0 sums , 以节省 CPU 时间。

此外：difference between contiguous and non-contiguous arrays 上的这个优秀答案帮了大忙！

Answer 1

数组是row-major。因此，当您对轴 1 求和时，这些数字位于连续的内存阵列中。这允许更好的缓存性能，因此更快的内存访问（参见“Locality of reference”）。我假设这就是您在这里看到的效果。

Answer 2

你有一个方阵。它看起来像这样：

1 2 3
4 5 6
7 8 9

但是计算机内存是线性寻址的，所以在计算机看来是这样的：

1 2 3 4 5 6 7 8 9

或者，如果您考虑一下，它可能看起来像这样：

1 4 7 2 5 8 3 6 9

如果您尝试对 [1 2 3] 或 [4 5 6]（一行）求和，第一个布局更快。如果您尝试对 [1 4 7] 或 [2 5 8] 求和，则第二个布局更快。

发生这种情况是因为一次从内存中加载一个 "cache line" 数据，通常为 64 字节（8 个值，NumPy 的默认 dtype 为 8 字节浮点数）。

您可以使用 order 参数控制构造数组时 NumPy 使用的布局。

有关这方面的更多信息，请参阅：https://en.wikipedia.org/wiki/Row-_and_column-major_order

Answer 3

确实，性能将取决于数组在内存中的顺序：

In [36]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)

In [37]: arrf = np.asfortranarray(arr) # change order

In [38]: %timeit arr.cumsum(axis=1)
1.99 ms ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [39]: %timeit arr.cumsum(axis=0)
14.6 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [41]: %timeit arrf.cumsum(axis=0)
1.96 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [42]: %timeit arrf.cumsum(axis=1)
14.6 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

有关详细信息，请参阅 https://docs.scipy.org/doc/numpy-1.13.0/reference/internals.html#multidimensional-array-indexing-order-issues

NumPy ufunc 在一个轴上比另一个轴快 2 倍

NumPy ufuncs are 2x faster in one axis over the other

python

performance

numpy

numpy-ufunc

numpy-ndarray