体系结构如何影响 numpy 数组操作性能？

Question

我有 Ubuntu 14.04，其中 "Anaconda" Python 发行版安装了英特尔的数学内核库 (MKL)。我的处理器是 Intel Xeon，有 8 个内核，没有超线程（所以只有 8 个线程）。

对我来说，对于大型数组，numpy tensordot 始终优于 einsum。然而，其他人发现 very little difference between the two or even that einsum may outperform numpy for some operations.

对于使用快速库构建的 numpy 发行版的人，我想知道为什么会发生这种情况。 MKL 运行在非英特尔处理器上是否更慢？还是 einsum 运行在具有更好线程功能的更现代的英特尔处理器上更快？

这是一个比较我机器上性能的快速示例代码：

In  [27]: a = rand(100,1000,2000)

In  [28]: b = rand(50,1000,2000)

In  [29]: time cten = tensordot(a, b, axes=[(1,2),(1,2)])
CPU times: user 7.85 s, sys: 29.4 ms, total: 7.88 s
Wall time: 1.08 s

In  [30]: "FLOPS TENSORDOT: {}.".format(cten.size * 1000 * 2000 / 1.08)
Out [30]: 'FLOPS TENSORDOT: 9259259259.26.'

In  [31]: time cein = einsum('ijk,ljk->il', a, b)
CPU times: user 42.3 s, sys: 7.58 ms, total: 42.3 s
Wall time: 42.4 s

In  [32]: "FLOPS EINSUM: {}.".format(cein.size * 1000 * 2000 / 42.4)
Out [32]: 'FLOPS EINSUM: 235849056.604.'

tensordot 运行的张量运算始终在 5-20 GFLOP 范围内。使用 einsum 我只能得到 0.2 GFLOPS。

Answer 1

本质上，您是在比较两个截然不同的事物：

np.einsum 在 C 中用 for 循环计算张量积。它有一些 SIMD 优化，但不是多线程的，也不使用 MLK。
np.tensordot，其中包含 reshaping/broadcasting 输入数组，然后调用 BLAS（MKL、OpenBLAS 等）进行矩阵乘法。 reshaping/broadcasting 阶段会导致一些额外的开销，但是 matrix multiplication is extremely well optimized 使用 SIMD、一些汇编程序和多线程。

因此，tensordot will be generally faster than einsum 在单核执行中，除非使用小数组大小（然后 reshaping/broadcasting 开销变得不可忽略）。更是如此，因为前一种方法是多线程的，而后者不是。

总而言之，您得到的结果是完全正常的，并且可能通常是正确的（Intel/non-Intel CPU，现代与否，多核与否，使用 MKL 或 OpenBLAS 等).

体系结构如何影响 numpy 数组操作性能？

How does architecture affect numpy array operation performance?

python

arrays

numpy

intel-mkl

numpy-einsum