使用转置优化数组加法和乘法

Optimizing array additions and multiplications with transposes

我想在数组转置的基础上做加法和乘法。

给定A是一个数组。 sum += p(A) * factor(p)

其中 p 是一个 permutation/transpose,factor(p) 是一个前置因子列表。比如A是一个二维数组,目标是

0.1*A + 0.1*transpose(A,(1,0))

我在我的Python代码中发现,当应用更高维排列时,数组添加的时间远远大于转置。也许 numpy.transposeC 中使用了指针。我想知道,有什么方法可以优化数组添加部分的时序吗? numpy.add 帮助不大。我应该以某种方式只对影响数组的转置部分求和,其余部分使用乘法吗?例如,排列 (0,1,3,2)(0,1) 部分在前一个数组的顶部乘以一个公因数。或者使用cpython来提高性能?

这是我的 Python 代码

import numpy as np
import time
import itertools as it

ref_list = [0, 1, 2, 3, 4]
p = it.permutations(ref_list)
transpose_list = tuple(p)
print(type(transpose_list),type(transpose_list[0]),transpose_list[0])


n_loop = 2
na = nb = nc = nd = ne = 20
A = np.random.random((na,nb,nc,nd,ne))
sum_A = np.zeros((na,nb,nc,nd,ne))
factor_list = [i*0.1 for i in range(120)]

time_transpose = 0
time_add = 0
time_multiply = 0

for n in range(n_loop):
    for m, t in enumerate(transpose_list):
        start = time.time()
        B = np.transpose(A, transpose_list[m] )  
        finish = time.time()
        time_transpose += finish - start

        start = time.time()
        B_p = B * factor_list[m]  
        finish = time.time()
        time_multiply += finish - start

        start = time.time()
        sum_A += B_p
        finish = time.time()
        time_add += finish - start

print(time_transpose, time_multiply, time_add, time_multiply/time_transpose, time_add/time_transpose)  

输出为

0.004961967468261719 1.1218750476837158 3.7830252647399902 226.09480107630213 762.404285988852

加法时间比转置大约 700 倍。

我尝试在

中使用 numba 的转置

通过添加


import numba as nb

@nb.njit('void(float64[:,::1], float64[:,::1])', parallel=True)
def transpose(mat, out):
    blockSize, tileSize = 256, 32  # To be tuned
    n, m = mat.shape
    assert blockSize % tileSize == 0
    for tmp in nb.prange((m+blockSize-1)//blockSize):
        i = tmp * blockSize
        for j in range(0, n, blockSize):
            tiMin, tiMax = i, min(i+blockSize, m)
            tjMin, tjMax = j, min(j+blockSize, n)
            for ti in range(tiMin, tiMax, tileSize):
                for tj in range(tjMin, tjMax, tileSize):
                    out[ti:ti+tileSize, tj:tj+tileSize] = mat[tj:tj+tileSize, ti:ti+tileSize].T

并使用

B = transpose(A, transpose_list[m] )

收到

Traceback (most recent call last):
  File "transpose_test_v2.py", line 46, in <module>
    B = transpose(A, transpose_list[m] )
  File "/home/.../lib/python3.8/site-packages/numba/core/dispatcher.py", line 717, in _explain_matching_error
    raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(float64, 6d, C), UniTuple(int64 x 6)

或使用 B = nb.transpose(A, transpose_list[m] ) 得到

    B = nb.transpose(A, transpose_list[m] )
AttributeError: 'int' object has no attribute 'transpose'

np.transpose 不会物理转置内存中的数组。它改变了数组的步幅,使得结果视图(实际上)被转置。问题是视图的内存布局在转置后非常低效,因此对其进行的操作效率不高。您可以执行转置视图的 copy 或调用 np.ascontiguousarray 以执行转置 eagerly。请注意,Numpy .

的当前实现

这是我机器上的结果:

Initial code:
0.0030066967010498047 1.2316536903381348 1.971841812133789 409.6368249940528 655.8166679882642

With a copy of the transposed array:
2.483657121658325 1.221949577331543 0.6028566360473633 0.4919960837893974 0.2427294133277298

除此之外,预分配数组并使用np.add和[=16的参数out =] 应该有助于使乘法和加法更快。为了获得更快的性能,您可以使用带循环的并行 Numba 代码。由于维数很多,在你的情况下优化转置非常困难。


更新:

这是一个更快的 Numba 实现,它主要消除了 add/multiply 的成本并减少了使用多线程的转置成本:

# Same setup than in the question

@numba.njit('(float64[:,:,:,:,::1], float64[:,:,:,:,::1], float64, int_, int_, int_, int_, int_)', parallel=True, fastmath=True)
def computeRound(A, sum_A, factor, i0, i1, i2, i3, i4):
    size = 20
    B = np.transpose(A, (i0, i1, i2, i3, i4))
    for j0 in numba.prange(size):
        for j1 in range(size):
            for j2 in range(size):
                for j3 in range(size):
                    for j4 in range(size):
                        sum_A[j0, j1, j2, j3, j4] += B[j0, j1, j2, j3, j4] * factor

for n in range(n_loop):
    for m, t in enumerate(transpose_list):
        i0, i1, i2, i3, i4 = transpose_list[m]
        computeRound(A, sum_A, factor_list[m], i0, i1, i2, i3, i4)

此代码比最初的代码快 4 倍。它几乎完全受到转置的不良访问模式的限制。然而,这很难优化转置。一种解决方案是按顺序应用转置,以最大化先前转置的缓存局部性(快速但特别难以实现)。一个更简单(但更慢)的解决方案是使用 previously-provided 优化的 Numba 换位代码,同时手动处理 (i0, i1, i2, i4, i3) 案例而不是 (i0, i1, i2, i3, i4) 案例。

您的错误消息看起来像“(m+blockSize-1)//blockSize”部分,其中除法为您提供双精度浮点数。你可以right-shift 8次并保持整数,而不是划分到blockSize(256)。