python 中巨大矩阵的点积的行总和
Row Sum of a dot product for huge matrix in python
我有 2 个矩阵 100kx200 和 200x100k
如果它们是小矩阵,我只会使用 numpy 点积
sum(a.dot(b), axis = 0)
但是矩阵太大了,而且我不能使用循环,有什么聪明的方法吗?
有两个 sum-reductions
发生 - 一个来自 np.dot
的矩阵乘法,然后是显式 sum
.
我们可以使用 np.einsum
一次性完成这两个操作,就像这样 -
np.einsum('ij,jk->k',a,b)
样本运行-
In [27]: a = np.random.rand(3,4)
In [28]: b = np.random.rand(4,3)
In [29]: np.sum(a.dot(b), axis = 0)
Out[29]: array([ 2.70084316, 3.07448582, 3.28690401])
In [30]: np.einsum('ij,jk->k',a,b)
Out[30]: array([ 2.70084316, 3.07448582, 3.28690401])
运行时测试-
In [45]: a = np.random.rand(1000,200)
In [46]: b = np.random.rand(200,1000)
In [47]: %timeit np.sum(a.dot(b), axis = 0)
100 loops, best of 3: 5.5 ms per loop
In [48]: %timeit np.einsum('ij,jk->k',a,b)
10 loops, best of 3: 71.8 ms per loop
遗憾的是,看起来我们在 np.einsum
方面并没有做得更好。
要更改为 np.sum(a.dot(b), axis = 1)
,只需在此处交换输出字符串符号 - np.einsum('ij,jk->i',a,b)
,就像这样 -
In [42]: np.sum(a.dot(b), axis = 1)
Out[42]: array([ 3.97805141, 3.2249661 , 1.85921549])
In [43]: np.einsum('ij,jk->i',a,b)
Out[43]: array([ 3.97805141, 3.2249661 , 1.85921549])
可能的优化是
>>> numpy.sum(a @ b, axis=0)
array([ 1.83633615, 18.71643672, 15.26981078, -46.33670382, 13.30276476])
>>> numpy.sum(a, axis=0) @ b
array([ 1.83633615, 18.71643672, 15.26981078, -46.33670382, 13.30276476])
计算 a @ b
需要 10k×200×10k 次操作,而先对行求和会将乘法减少到 1×200×10k 次操作,提高 10k× 次。
这主要是因为认识
numpy.sum(x, axis=0) == [1, 1, ..., 1] @ x
=> numpy.sum(a @ b, axis=0) == [1, 1, ..., 1] @ (a @ b)
== ([1, 1, ..., 1] @ a) @ b
== numpy.sum(a, axis=0) @ b
另一个轴类似。
>>> numpy.sum(a @ b, axis=1)
array([ 2.8794171 , 9.12128399, 14.52009991, -8.70177811, -15.0303783 ])
>>> a @ numpy.sum(b, axis=1)
array([ 2.8794171 , 9.12128399, 14.52009991, -8.70177811, -15.0303783 ])
(注:x @ y
is equivalent to x.dot(y)
for on Python 3.5+ with numpy 1.10.0+)
$ INITIALIZATION='import numpy;numpy.random.seed(0);a=numpy.random.randn(1000,200);b=numpy.random.rand(200,1000)'
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.einsum("ij,jk->k", a, b)'
10 loops, best of 3: 87.2 msec per loop
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.sum(a@b, axis=0)'
100 loops, best of 3: 12.8 msec per loop
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.sum(a, axis=0)@b'
1000 loops, best of 3: 300 usec per loop
插图:
In [235]: a = np.random.rand(3,3)
array([[ 0.465, 0.758, 0.641],
[ 0.897, 0.673, 0.742],
[ 0.763, 0.274, 0.485]])
In [237]: b = np.random.rand(3,2)
array([[ 0.303, 0.378],
[ 0.039, 0.095],
[ 0.192, 0.668]])
现在,如果我们简单地做 a @ b
,我们将需要 18 个乘法和 6 个加法运算 。另一方面,如果我们做 np.sum(a, axis=0) @ b
,我们只需要 6 个乘法运算和 2 个加法运算 。提高了 3 倍,因为我们在 a
中有 3 行。至于 OP 的情况,这应该比简单的 a @ b
计算提高 10k 倍,因为他在 a
.
中有 10k 行
使用我添加到 Divakar 的答案中的想法进行一些快速测试:
In [162]: a = np.random.rand(1000,200)
In [163]: b = np.random.rand(200,1000)
In [174]: timeit c1=np.sum(a.dot(b), axis=0)
10 loops, best of 3: 27.7 ms per loop
In [175]: timeit c2=np.sum(a,axis=0).dot(b)
1000 loops, best of 3: 432 µs per loop
In [176]: timeit c3=np.einsum('ij,jk->k',a,b)
10 loops, best of 3: 170 ms per loop
In [177]: timeit c4=np.einsum('j,jk->k', np.einsum('ij->j', a), b)
1000 loops, best of 3: 353 µs per loop
In [178]: timeit np.einsum('ij->j', a) @b
1000 loops, best of 3: 304 µs per loop
einsum
实际上比 np.sum
快!
In [180]: timeit np.einsum('ij->j', a)
1000 loops, best of 3: 173 µs per loop
In [181]: timeit np.sum(a,0)
1000 loops, best of 3: 312 µs per loop
对于更大的阵列,einsum
优势会降低
In [183]: a = np.random.rand(100000,200)
In [184]: b = np.random.rand(200,100000)
In [185]: timeit np.einsum('ij->j', a) @b
10 loops, best of 3: 51.5 ms per loop
In [186]: timeit c2=np.sum(a,axis=0).dot(b)
10 loops, best of 3: 59.5 ms per loop
我有 2 个矩阵 100kx200 和 200x100k 如果它们是小矩阵,我只会使用 numpy 点积
sum(a.dot(b), axis = 0)
但是矩阵太大了,而且我不能使用循环,有什么聪明的方法吗?
有两个 sum-reductions
发生 - 一个来自 np.dot
的矩阵乘法,然后是显式 sum
.
我们可以使用 np.einsum
一次性完成这两个操作,就像这样 -
np.einsum('ij,jk->k',a,b)
样本运行-
In [27]: a = np.random.rand(3,4)
In [28]: b = np.random.rand(4,3)
In [29]: np.sum(a.dot(b), axis = 0)
Out[29]: array([ 2.70084316, 3.07448582, 3.28690401])
In [30]: np.einsum('ij,jk->k',a,b)
Out[30]: array([ 2.70084316, 3.07448582, 3.28690401])
运行时测试-
In [45]: a = np.random.rand(1000,200)
In [46]: b = np.random.rand(200,1000)
In [47]: %timeit np.sum(a.dot(b), axis = 0)
100 loops, best of 3: 5.5 ms per loop
In [48]: %timeit np.einsum('ij,jk->k',a,b)
10 loops, best of 3: 71.8 ms per loop
遗憾的是,看起来我们在 np.einsum
方面并没有做得更好。
要更改为 np.sum(a.dot(b), axis = 1)
,只需在此处交换输出字符串符号 - np.einsum('ij,jk->i',a,b)
,就像这样 -
In [42]: np.sum(a.dot(b), axis = 1)
Out[42]: array([ 3.97805141, 3.2249661 , 1.85921549])
In [43]: np.einsum('ij,jk->i',a,b)
Out[43]: array([ 3.97805141, 3.2249661 , 1.85921549])
可能的优化是
>>> numpy.sum(a @ b, axis=0)
array([ 1.83633615, 18.71643672, 15.26981078, -46.33670382, 13.30276476])
>>> numpy.sum(a, axis=0) @ b
array([ 1.83633615, 18.71643672, 15.26981078, -46.33670382, 13.30276476])
计算 a @ b
需要 10k×200×10k 次操作,而先对行求和会将乘法减少到 1×200×10k 次操作,提高 10k× 次。
这主要是因为认识
numpy.sum(x, axis=0) == [1, 1, ..., 1] @ x
=> numpy.sum(a @ b, axis=0) == [1, 1, ..., 1] @ (a @ b)
== ([1, 1, ..., 1] @ a) @ b
== numpy.sum(a, axis=0) @ b
另一个轴类似。
>>> numpy.sum(a @ b, axis=1)
array([ 2.8794171 , 9.12128399, 14.52009991, -8.70177811, -15.0303783 ])
>>> a @ numpy.sum(b, axis=1)
array([ 2.8794171 , 9.12128399, 14.52009991, -8.70177811, -15.0303783 ])
(注:x @ y
is equivalent to x.dot(y)
for
$ INITIALIZATION='import numpy;numpy.random.seed(0);a=numpy.random.randn(1000,200);b=numpy.random.rand(200,1000)'
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.einsum("ij,jk->k", a, b)'
10 loops, best of 3: 87.2 msec per loop
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.sum(a@b, axis=0)'
100 loops, best of 3: 12.8 msec per loop
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.sum(a, axis=0)@b'
1000 loops, best of 3: 300 usec per loop
插图:
In [235]: a = np.random.rand(3,3)
array([[ 0.465, 0.758, 0.641],
[ 0.897, 0.673, 0.742],
[ 0.763, 0.274, 0.485]])
In [237]: b = np.random.rand(3,2)
array([[ 0.303, 0.378],
[ 0.039, 0.095],
[ 0.192, 0.668]])
现在,如果我们简单地做 a @ b
,我们将需要 18 个乘法和 6 个加法运算 。另一方面,如果我们做 np.sum(a, axis=0) @ b
,我们只需要 6 个乘法运算和 2 个加法运算 。提高了 3 倍,因为我们在 a
中有 3 行。至于 OP 的情况,这应该比简单的 a @ b
计算提高 10k 倍,因为他在 a
.
使用我添加到 Divakar 的答案中的想法进行一些快速测试:
In [162]: a = np.random.rand(1000,200)
In [163]: b = np.random.rand(200,1000)
In [174]: timeit c1=np.sum(a.dot(b), axis=0)
10 loops, best of 3: 27.7 ms per loop
In [175]: timeit c2=np.sum(a,axis=0).dot(b)
1000 loops, best of 3: 432 µs per loop
In [176]: timeit c3=np.einsum('ij,jk->k',a,b)
10 loops, best of 3: 170 ms per loop
In [177]: timeit c4=np.einsum('j,jk->k', np.einsum('ij->j', a), b)
1000 loops, best of 3: 353 µs per loop
In [178]: timeit np.einsum('ij->j', a) @b
1000 loops, best of 3: 304 µs per loop
einsum
实际上比 np.sum
快!
In [180]: timeit np.einsum('ij->j', a)
1000 loops, best of 3: 173 µs per loop
In [181]: timeit np.sum(a,0)
1000 loops, best of 3: 312 µs per loop
对于更大的阵列,einsum
优势会降低
In [183]: a = np.random.rand(100000,200)
In [184]: b = np.random.rand(200,100000)
In [185]: timeit np.einsum('ij->j', a) @b
10 loops, best of 3: 51.5 ms per loop
In [186]: timeit c2=np.sum(a,axis=0).dot(b)
10 loops, best of 3: 59.5 ms per loop