许多小矩阵加速循环

Question

我有一个大坐标网格（向量 a 和 b），为此我生成并求解矩阵 (10x10) 方程。有没有办法让 scipy.linalg.solve 接受矢量输入？到目前为止，我的解决方案是运行 for 在坐标数组上循环。

import time
import numpy as np
import scipy.linalg

N = 10
a = np.linspace(0, 1, 10**3)
b = np.linspace(0, 1, 2*10**3)
A = np.random.random((N, N))      # input matrix, not static

def f(a,b,n):     # array-filling function
   return a*b*n

def sol(x,y):   # matrix solver
   D = np.arange(0,N)
   B = f(x,y,D)**2 + f(x-1, y+1, D)      # source vector
   X = scipy.linalg.solve(A,B)
   return X    # output an N-size vector

start = time.time()

answer = np.zeros(shape=(a.size, b.size)) # predefine output array

for egg in range(a.size):             # an ugly double-for cycle
   for ham in range(b.size):
       aa = a[egg]
       bb = b[ham]
       answer[egg,ham] = sol(aa,bb)[0]

print time.time() - start

Answer 1

@yevgeniy 你是对的，有效地解决多个独立的线性系统 A x = b 与 scipy 有点棘手（假设 A 每次迭代都会改变的数组）。

例如，这是求解 1000 个形式为 A x = b 的系统的基准，其中 A 是 10x10 矩阵，b一个10元素向量。令人惊讶的是，将所有这些放入一个块对角矩阵并调用 scipy.linalg.solve 一次的方法对于密集矩阵和稀疏矩阵确实都比较慢。

import numpy as np
from scipy.linalg import block_diag, solve
from scipy.sparse import block_diag as sp_block_diag
from scipy.sparse.linalg import spsolve

N = 10
M = 1000 # number of coordinates 
Ai = np.random.randn(N, N) # we can compute the inverse here,
# but let's assume that Ai are different matrices in the for loop loop
bi = np.random.randn(N)

%timeit [solve(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 32.1 ms per loop

Afull = sp_block_diag([Ai]*M, format='csr')
bfull = np.tile(bi, M)

%timeit Afull = sp_block_diag([Ai]*M, format='csr')
%timeit spsolve(Afull, bfull)

# 1 loops, best of 3: 303 ms per loop
# 100 loops, best of 3: 5.55 ms per loop

Afull = block_diag(*[Ai]*M) 

%timeit Afull = block_diag(*[Ai]*M)
%timeit solve(Afull, bfull)

# 100 loops, best of 3: 14.1 ms per loop
# 1 loops, best of 3: 23.6 s per loop

具有稀疏数组的线性系统的解决方案更快，但创建此块对角线数组的时间实际上非常慢。至于密集数组，在这种情况下它们只是更慢（并且占用大量 RAM）。

也许我遗漏了一些有关如何使稀疏数组有效工作的信息，但如果您保留 for 循环，则可以做两件事来进行优化。

来自纯 python，查看 scipy.linalg.solve 的源代码：删除不必要的测试并分解循环外的所有重复操作。例如，假设你的数组不是对称正数，我们可以做

from scipy.linalg import get_lapack_funcs

gesv, = get_lapack_funcs(('gesv',), (Ai, bi))

def solve_opt(A, b, gesv=gesv):
    # not sure if copying A and B is necessary, but just in case (faster if arrays are not copied)
    lu, piv, x, info = gesv(A.copy(), b.copy(), overwrite_a=False, overwrite_b=False)
    if info == 0:
        return x
    if info > 0:
        raise LinAlgError("singular matrix")
    raise ValueError('illegal value in %d-th argument of internal gesv|posv' % -info)

%timeit [solve(Ai, bi) for el in range(M)]
%timeit [solve_opt(Ai, bi) for el in range(M)]

# 10 loops, best of 3: 30.1 ms per loop
# 100 loops, best of 3: 3.77 ms per loop

这导致 6.5 倍的加速。

如果您需要更好的性能，则必须在 Cython 中移植此 for 循环并直接在 C 中连接 gesv BLAS 函数，如所讨论的 here，或者更好地使用 Cython API 对于 Scipy 0.16 中的 BLAS/LAPACK。

编辑：正如@Eelco Hoogendoorn 提到的，如果您的 A 矩阵是固定的，则有一种更简单、更有效的方法。

Answer 2

为了说明我关于广义 ufunc 的观点，以及消除 egg 和 ham 循环的能力，请考虑以下代码：

import numpy as np
A = np.random.randn(4,4,10,10)
AI = np.linalg.inv(A)
#check that generalized ufuncs work as expected
I = np.einsum('xyij,xyjk->xyik', A, AI)
print np.allclose(I, np.eye(10)[np.newaxis, np.newaxis, :, :])

许多小矩阵加速循环

Many small matrices speed-up for loops

python

for-loop

numpy

blas

scipy