double np.einsum 的性能以及如何加速
Performance of double np.einsum and how to speed up
考虑这个 MWE:
import numpy as np
a = np.random.uniform(0,1,size=[14,25,25])
b = np.random.uniform(0,1,size=[14,25,25])
c = np.random.uniform(0,1,size=[14,25])
def my_func(a,b,c):
InnerSum = np.einsum('lpk, lkm -> lpm', a, b)
OuterSum = np.einsum('lp, lpm -> lm', c, InnerSum )
Result = 2 * OuterSum
return Result
my_func() 是我第一次尝试进行计算,但我想加快速度。然后我尝试使用以下修改后的函数:
def my_func_2(a,b,c):
OuterSum = np.einsum('lpk, lkm, lp -> lm', a, b, c)
Result = 2 * OuterSum
return Result
然而,当我在两个函数上 运行 %timeit
时,我得到
%timeit my_func(a,b,c)
293 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_func_2(a,b,c)
347 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
为什么第二种方法比第一种慢?如何优化 my_func() 使其更快?
鉴于循环计数(a
和 b
的长度)与沿其他轴的长度相比并不是一个很大的数字,我们可以 运行简单循环并利用 BLAS 在每次迭代时支持 matrix-multiplication。长度也意味着每次迭代足够 sum-reductions,为这种情况证明 for-loop 是合理的。
实施将是 -
N,M = b.shape[::2]
out = np.empty((N,M))
for i in range(N):
out[i] = c[i].dot(a[i]).dot(b[i])
out *= 2
基准测试
使用 optimize
参数,这似乎大大提高了 my_func_2
的性能,并且还添加了提议的另一个功能 -
def my_func(a,b,c, optimize=False):
InnerSum = np.einsum('lpk, lkm -> lpm', a, b,optimize=optimize)
OuterSum = np.einsum('lp, lpm -> lm', c, InnerSum, optimize=optimize)
Result = 2 * OuterSum
return Result
def my_func_2(a,b,c, optimize=False):
OuterSum = np.einsum('lpk, lkm, lp -> lm', a, b, c,optimize=optimize)
Result = 2 * OuterSum
return Result
def my_func_3(a,b,c):
N,M = b.shape[::2]
out = np.empty((N,M))
for i in range(N):
out[i] = c[i].dot(a[i]).dot(b[i])
out *= 2
return out
计时 -
In [51]: # Setup used in the question
...: np.random.seed(0)
...: a = np.random.uniform(0,1,size=[14,25,25])
...: b = np.random.uniform(0,1,size=[14,25,25])
...: c = np.random.uniform(0,1,size=[14,25])
# With einsum optimize set as False
In [52]: %timeit my_func(a,b,c, optimize=False)
...: %timeit my_func_2(a,b,c, optimize=False)
...: %timeit my_func_3(a,b,c)
1000 loops, best of 3: 255 µs per loop
1000 loops, best of 3: 302 µs per loop
10000 loops, best of 3: 28.7 µs per loop
# With einsum optimize set as True
In [53]: %timeit my_func(a,b,c, optimize=True)
...: %timeit my_func_2(a,b,c, optimize=True)
...: %timeit my_func_3(a,b,c)
1000 loops, best of 3: 334 µs per loop
10000 loops, best of 3: 77.6 µs per loop
10000 loops, best of 3: 28.6 µs per loop
考虑这个 MWE:
import numpy as np
a = np.random.uniform(0,1,size=[14,25,25])
b = np.random.uniform(0,1,size=[14,25,25])
c = np.random.uniform(0,1,size=[14,25])
def my_func(a,b,c):
InnerSum = np.einsum('lpk, lkm -> lpm', a, b)
OuterSum = np.einsum('lp, lpm -> lm', c, InnerSum )
Result = 2 * OuterSum
return Result
my_func() 是我第一次尝试进行计算,但我想加快速度。然后我尝试使用以下修改后的函数:
def my_func_2(a,b,c):
OuterSum = np.einsum('lpk, lkm, lp -> lm', a, b, c)
Result = 2 * OuterSum
return Result
然而,当我在两个函数上 运行 %timeit
时,我得到
%timeit my_func(a,b,c)
293 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_func_2(a,b,c)
347 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
为什么第二种方法比第一种慢?如何优化 my_func() 使其更快?
鉴于循环计数(a
和 b
的长度)与沿其他轴的长度相比并不是一个很大的数字,我们可以 运行简单循环并利用 BLAS 在每次迭代时支持 matrix-multiplication。长度也意味着每次迭代足够 sum-reductions,为这种情况证明 for-loop 是合理的。
实施将是 -
N,M = b.shape[::2]
out = np.empty((N,M))
for i in range(N):
out[i] = c[i].dot(a[i]).dot(b[i])
out *= 2
基准测试
使用 optimize
参数,这似乎大大提高了 my_func_2
的性能,并且还添加了提议的另一个功能 -
def my_func(a,b,c, optimize=False):
InnerSum = np.einsum('lpk, lkm -> lpm', a, b,optimize=optimize)
OuterSum = np.einsum('lp, lpm -> lm', c, InnerSum, optimize=optimize)
Result = 2 * OuterSum
return Result
def my_func_2(a,b,c, optimize=False):
OuterSum = np.einsum('lpk, lkm, lp -> lm', a, b, c,optimize=optimize)
Result = 2 * OuterSum
return Result
def my_func_3(a,b,c):
N,M = b.shape[::2]
out = np.empty((N,M))
for i in range(N):
out[i] = c[i].dot(a[i]).dot(b[i])
out *= 2
return out
计时 -
In [51]: # Setup used in the question
...: np.random.seed(0)
...: a = np.random.uniform(0,1,size=[14,25,25])
...: b = np.random.uniform(0,1,size=[14,25,25])
...: c = np.random.uniform(0,1,size=[14,25])
# With einsum optimize set as False
In [52]: %timeit my_func(a,b,c, optimize=False)
...: %timeit my_func_2(a,b,c, optimize=False)
...: %timeit my_func_3(a,b,c)
1000 loops, best of 3: 255 µs per loop
1000 loops, best of 3: 302 µs per loop
10000 loops, best of 3: 28.7 µs per loop
# With einsum optimize set as True
In [53]: %timeit my_func(a,b,c, optimize=True)
...: %timeit my_func_2(a,b,c, optimize=True)
...: %timeit my_func_3(a,b,c)
1000 loops, best of 3: 334 µs per loop
10000 loops, best of 3: 77.6 µs per loop
10000 loops, best of 3: 28.6 µs per loop