为什么优化einsum可以加速二元收缩?
Why optimize in the einsum can accelerate binary contraction?
在https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
optimize{False, True, ‘greedy’, ‘optimal’}, optional
Controls if intermediate optimization should occur. No optimization will occur if False and True will default to the ‘greedy’ algorithm. Also accepts an explicit contraction list from the np.einsum_path function. See np.einsum_path for more details. Defaults to False.
在我看来optimize
标志是选择多个缩略语的顺序。例如,
A B C -> D
对于更快的 (AB)C 或 A(BC) 或 (AC)B,而不是二进制收缩,例如 AB->C
.
下面的代码为A[a,b] * B[b,c,d] = C[a,c,d]
import numpy as np
import time
import scipy.stats
# from
def mean_confidence_interval(data, var_name, unit, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
print(var_name, round(m, 5), "\u00B1", round(h, 5), unit )
def einsum_greedy(A, B, na, nb, nc, nd):
#res = np.zeros((na,nc,nd))
res = np.einsum('ab,bcd->acd', A, B, optimize="greedy")
return res
def einsum_standard(A, B, na, nb, nc, nd):
# res = np.zeros((na,nc,nd))
res = np.einsum('ab,bcd->acd', A, B)
return res
def btime_ABC(name_def, name_out, A, B, C, na, nb, nc, nd, n_times):
global opt_path
list_time = []
for i in range(n_times):
start_time = time.time()
C = name_def(A, B, na, nb, nc, nd)
finish_time = time.time()
list_time.append(finish_time - start_time)
mean_confidence_interval(list_time, name_out, 's' )
# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = dim_comm = 90
n_times = 60
print('number of common dimension', dim_comm)
print('number of averaged time', n_times)
A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C1 = np.zeros((na,nc,nd))
C2 = np.zeros((na,nc,nd))
btime_ABC(einsum_standard, 'einsum_standard', A, B, C1, na, nb, nc, nd, n_times)
btime_ABC(einsum_greedy, 'einsum_greedy', A, B, C2, na, nb, nc, nd, n_times)
我得到了
number of common dimension 90
number of averaged time 60
einsum_standard 0.04799 ± 0.00312 s
einsum_greedy 0.00805 ± 0.00137 s
optimize
标志有助于二进制收缩 A[a,b] * B[b,c,d] = C[a,c,d]
。那么,为什么?
我的时间:
In [26]: A = np.random.random((90,80))
In [27]: B = np.random.random((80,81,82))
In [28]: timeit np.einsum('ab,bcd->acd',A,B,optimize=False)
39.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [29]: timeit np.einsum('ab,bcd->acd',A,B,optimize=True)
9.06 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
查看 einsum
代码,我明白了:
# If no optimization, run pure einsum
if optimize is False:
return c_einsum(*operands, **kwargs)
如果没有,它会检查各种参数,然后
operands, contraction_list = einsum_path(*operands, optimize=optimize,
einsum_call=True)
...
# Call tensordot if still possible
if blas:
...
new_view = tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)))
由于我们只有 2 个参数并且 path
很简单,我认为 True
的情况只是:
In [30]: timeit np.tensordot(A,B,(1,0))
7.62 ms ± 609 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
根据过去对 tensordot
的研究:
In [31]: timeit (A@B.reshape(80,-1)).reshape(90,81,82)
6.44 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
所以基本上时间差在 运行 编译的“纯 einsum”和将其转换为 matmul
问题的替代方案之间,后者可以使用优化的 BLAS
例程. [29] 时间似乎是 [31] 时间加上一些开销。
在https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
optimize{False, True, ‘greedy’, ‘optimal’}, optional Controls if intermediate optimization should occur. No optimization will occur if False and True will default to the ‘greedy’ algorithm. Also accepts an explicit contraction list from the np.einsum_path function. See np.einsum_path for more details. Defaults to False.
在我看来optimize
标志是选择多个缩略语的顺序。例如,
A B C -> D
对于更快的 (AB)C 或 A(BC) 或 (AC)B,而不是二进制收缩,例如 AB->C
.
下面的代码为A[a,b] * B[b,c,d] = C[a,c,d]
import numpy as np
import time
import scipy.stats
# from
def mean_confidence_interval(data, var_name, unit, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
print(var_name, round(m, 5), "\u00B1", round(h, 5), unit )
def einsum_greedy(A, B, na, nb, nc, nd):
#res = np.zeros((na,nc,nd))
res = np.einsum('ab,bcd->acd', A, B, optimize="greedy")
return res
def einsum_standard(A, B, na, nb, nc, nd):
# res = np.zeros((na,nc,nd))
res = np.einsum('ab,bcd->acd', A, B)
return res
def btime_ABC(name_def, name_out, A, B, C, na, nb, nc, nd, n_times):
global opt_path
list_time = []
for i in range(n_times):
start_time = time.time()
C = name_def(A, B, na, nb, nc, nd)
finish_time = time.time()
list_time.append(finish_time - start_time)
mean_confidence_interval(list_time, name_out, 's' )
# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = dim_comm = 90
n_times = 60
print('number of common dimension', dim_comm)
print('number of averaged time', n_times)
A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C1 = np.zeros((na,nc,nd))
C2 = np.zeros((na,nc,nd))
btime_ABC(einsum_standard, 'einsum_standard', A, B, C1, na, nb, nc, nd, n_times)
btime_ABC(einsum_greedy, 'einsum_greedy', A, B, C2, na, nb, nc, nd, n_times)
我得到了
number of common dimension 90
number of averaged time 60
einsum_standard 0.04799 ± 0.00312 s
einsum_greedy 0.00805 ± 0.00137 s
optimize
标志有助于二进制收缩 A[a,b] * B[b,c,d] = C[a,c,d]
。那么,为什么?
我的时间:
In [26]: A = np.random.random((90,80))
In [27]: B = np.random.random((80,81,82))
In [28]: timeit np.einsum('ab,bcd->acd',A,B,optimize=False)
39.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [29]: timeit np.einsum('ab,bcd->acd',A,B,optimize=True)
9.06 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
查看 einsum
代码,我明白了:
# If no optimization, run pure einsum
if optimize is False:
return c_einsum(*operands, **kwargs)
如果没有,它会检查各种参数,然后
operands, contraction_list = einsum_path(*operands, optimize=optimize,
einsum_call=True)
...
# Call tensordot if still possible
if blas:
...
new_view = tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)))
由于我们只有 2 个参数并且 path
很简单,我认为 True
的情况只是:
In [30]: timeit np.tensordot(A,B,(1,0))
7.62 ms ± 609 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
根据过去对 tensordot
的研究:
In [31]: timeit (A@B.reshape(80,-1)).reshape(90,81,82)
6.44 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
所以基本上时间差在 运行 编译的“纯 einsum”和将其转换为 matmul
问题的替代方案之间,后者可以使用优化的 BLAS
例程. [29] 时间似乎是 [31] 时间加上一些开销。