如何使用 numpy 找到两个非常大的矩阵的行之间的成对差异?
How to find the pairwise differences between rows of two very large matrices using numpy?
给定两个矩阵,我想计算所有行之间的成对差异。每个矩阵有 1000 行和 100 列,因此它们相当大。我尝试使用 for 循环和纯广播,但 for 循环似乎工作得更快。难道我做错了什么?这是代码:
from numpy import *
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
广播方法需要大约 1 秒的时间,对于大矩阵来说甚至更长。知道如何纯粹使用 numpy 来加快速度吗?
这是一个避免循环和大中间体的解决方案:
from numpy import *
import time
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
#matmul
start = time.time()
add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*dot(A,B.T)
print time.time() - start
打印:
0.546781778336
0.674743175507
0.10723400116
np.einsum
的另一份工作
np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], A[:,None,:] - B[None,:,:])
这是另一种执行方式:
(a-b)^2 = a^2 + b^2 - 2ab
前两项为 np.einsum
,第三项为 dot-product
-
import numpy as np
np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)
运行时测试
接近-
def loopy_app(A,B):
m,n = A.shape[0], B.shape[0]
out = np.empty((m,n))
for i,a in enumerate(A):
out[i] = np.sum((a - B)**2,1)
return out
def broadcasting_app(A,B):
return ((A[:,np.newaxis,:] - B)**2).sum(-1)
# @Paul Panzer's soln
def outer_sum_dot_app(A,B):
return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)
# @Daniel Forsman's soln
def einsum_all_app(A,B):
return np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], \
A[:,None,:] - B[None,:,:])
# Proposed in this post
def outer_einsum_dot_app(A,B):
return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - \
2*np.dot(A,B.T)
计时 -
In [51]: A = np.random.randn(1000,100)
...: B = np.random.randn(1000,100)
...:
In [52]: %timeit loopy_app(A,B)
...: %timeit broadcasting_app(A,B)
...: %timeit outer_sum_dot_app(A,B)
...: %timeit einsum_all_app(A,B)
...: %timeit outer_einsum_dot_app(A,B)
...:
10 loops, best of 3: 136 ms per loop
1 loops, best of 3: 302 ms per loop
100 loops, best of 3: 8.51 ms per loop
1 loops, best of 3: 341 ms per loop
100 loops, best of 3: 8.38 ms per loop
类似于@paul-panzer,计算任意维度数组的成对差异的一般方法是广播如下:
设 v 为大小为 (n, d) 的 NumPy 数组:
import numpy as np
v_tiled_across = np.tile(v[:, np.newaxis, :], (1, v.shape[0], 1))
v_tiled_down = np.tile(v[np.newaxis, :, :], (v.shape[0], 1, 1))
result = v_tiled_across - v_tiled_down
为了更好地了解正在发生的事情,想象一下 v 的每个 d 维行都像旗杆一样被支撑起来,并上下复制。现在,当您进行分量减法时,您将得到每个成对组合。
__
还有 scipy.spatial.distance.pdist,它以成对方式计算指标。
from scipy.spatial.distance import pdist, squareform
pairwise_L2_norms = squareform(pdist(v))
给定两个矩阵,我想计算所有行之间的成对差异。每个矩阵有 1000 行和 100 列,因此它们相当大。我尝试使用 for 循环和纯广播,但 for 循环似乎工作得更快。难道我做错了什么?这是代码:
from numpy import *
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
广播方法需要大约 1 秒的时间,对于大矩阵来说甚至更长。知道如何纯粹使用 numpy 来加快速度吗?
这是一个避免循环和大中间体的解决方案:
from numpy import *
import time
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
#matmul
start = time.time()
add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*dot(A,B.T)
print time.time() - start
打印:
0.546781778336
0.674743175507
0.10723400116
np.einsum
np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], A[:,None,:] - B[None,:,:])
这是另一种执行方式:
(a-b)^2 = a^2 + b^2 - 2ab
前两项为 np.einsum
,第三项为 dot-product
-
import numpy as np
np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)
运行时测试
接近-
def loopy_app(A,B):
m,n = A.shape[0], B.shape[0]
out = np.empty((m,n))
for i,a in enumerate(A):
out[i] = np.sum((a - B)**2,1)
return out
def broadcasting_app(A,B):
return ((A[:,np.newaxis,:] - B)**2).sum(-1)
# @Paul Panzer's soln
def outer_sum_dot_app(A,B):
return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)
# @Daniel Forsman's soln
def einsum_all_app(A,B):
return np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], \
A[:,None,:] - B[None,:,:])
# Proposed in this post
def outer_einsum_dot_app(A,B):
return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - \
2*np.dot(A,B.T)
计时 -
In [51]: A = np.random.randn(1000,100)
...: B = np.random.randn(1000,100)
...:
In [52]: %timeit loopy_app(A,B)
...: %timeit broadcasting_app(A,B)
...: %timeit outer_sum_dot_app(A,B)
...: %timeit einsum_all_app(A,B)
...: %timeit outer_einsum_dot_app(A,B)
...:
10 loops, best of 3: 136 ms per loop
1 loops, best of 3: 302 ms per loop
100 loops, best of 3: 8.51 ms per loop
1 loops, best of 3: 341 ms per loop
100 loops, best of 3: 8.38 ms per loop
类似于@paul-panzer,计算任意维度数组的成对差异的一般方法是广播如下:
设 v 为大小为 (n, d) 的 NumPy 数组:
import numpy as np
v_tiled_across = np.tile(v[:, np.newaxis, :], (1, v.shape[0], 1))
v_tiled_down = np.tile(v[np.newaxis, :, :], (v.shape[0], 1, 1))
result = v_tiled_across - v_tiled_down
为了更好地了解正在发生的事情,想象一下 v 的每个 d 维行都像旗杆一样被支撑起来,并上下复制。现在,当您进行分量减法时,您将得到每个成对组合。
__
还有 scipy.spatial.distance.pdist,它以成对方式计算指标。
from scipy.spatial.distance import pdist, squareform
pairwise_L2_norms = squareform(pdist(v))