有效地更新点之间的距离
Efficiently updating distance between points
我有一个包含 n 行(观察)和 p 列(特征)的数据集:
import numpy as np
from scipy.spatial.distance import pdist, squareform
p = 3
n = 5
xOld = np.random.rand(n * p).reshape([n, p])
我有兴趣获得 nxn
矩阵中这些点之间的距离,该矩阵确实具有 n x (n-1)/2
个唯一值
sq_dists = pdist(xOld, 'sqeuclidean')
D_n = squareform(sq_dists)
现在假设我得到 N
额外的观察结果并想更新 D_n
。一种非常低效的方法是:
N = 3
xNew = np.random.rand(N * p).reshape([N, p])
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)
但是,考虑到 n ~ 10000 和 N ~ 100,这将是多余的。我的目标是使用 D_n
更有效地获得 D_n_N
。为此,我将 D_n_N 划分如下。我已经 D_n
并且可以计算 B [N x N]
。但是,我想知道是否有一种不用大量for循环来计算A(或A转置)并最终构造D_n_N
的好方法
D_n (n x n) A [n x N]
A.T [N x n] B [N x N]
提前致谢。
很有趣的问题!好吧,在获得解决方案的过程中,我在这里学到了一些新东西。
涉及的步骤:
首先,我们在这里介绍新的积分。因此,我们需要使用 cdist
来获得新旧点之间的平方欧氏距离。这些将被容纳在新输出的两个块中,一个位于旧距离的正下方,一个位于那些旧距离的右侧。
我们还需要计算新点中的pdist
,并将其square-formed
块放在新对角线区域的尾部。
新 D_n_N
的示意图如下所示:
[ D_n cdist.T
cdist New pdist squarefomed]
总而言之,实现看起来是这样的 -
cdists = cdist( xNew, xOld, 'sqeuclidean')
n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
运行时测试
接近 -
# Original approach
def org_app(D_n, xNew):
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)
return D_n_N
# Proposed approach
def proposed_app(D_n, xNew, N):
cdists = cdist( xNew, xOld, 'sqeuclidean')
n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
return out
计时 -
In [102]: # Setup inputs
...: p = 3
...: n = 5000
...: xOld = np.random.rand(n * p).reshape([n, p])
...:
...: sq_dists = pdist(xOld, 'sqeuclidean')
...: D_n = squareform(sq_dists)
...:
...: N = 3000
...: xNew = np.random.rand(N * p).reshape([N, p])
...:
In [103]: np.allclose( proposed_app(D_n, xNew, N), org_app(D_n, xNew))
Out[103]: True
In [104]: %timeit org_app(D_n, xNew)
1 loops, best of 3: 541 ms per loop
In [105]: %timeit proposed_app(D_n, xNew, N)
1 loops, best of 3: 201 ms per loop
只需使用 cdist :
D_OO=cdist(xOld,xOld)
D_NN=cdist(xNew,xNew)
D_NO=cdist(xNew,xOld)
D_ON=cdist(xOld,xNew) # or D_NO.T
最后:
D_=vstack((hstack((D_OO,D_ON)),(hstack((D_NO,D_NN)))))
我有一个包含 n 行(观察)和 p 列(特征)的数据集:
import numpy as np
from scipy.spatial.distance import pdist, squareform
p = 3
n = 5
xOld = np.random.rand(n * p).reshape([n, p])
我有兴趣获得 nxn
矩阵中这些点之间的距离,该矩阵确实具有 n x (n-1)/2
个唯一值
sq_dists = pdist(xOld, 'sqeuclidean')
D_n = squareform(sq_dists)
现在假设我得到 N
额外的观察结果并想更新 D_n
。一种非常低效的方法是:
N = 3
xNew = np.random.rand(N * p).reshape([N, p])
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)
但是,考虑到 n ~ 10000 和 N ~ 100,这将是多余的。我的目标是使用 D_n
更有效地获得 D_n_N
。为此,我将 D_n_N 划分如下。我已经 D_n
并且可以计算 B [N x N]
。但是,我想知道是否有一种不用大量for循环来计算A(或A转置)并最终构造D_n_N
D_n (n x n) A [n x N]
A.T [N x n] B [N x N]
提前致谢。
很有趣的问题!好吧,在获得解决方案的过程中,我在这里学到了一些新东西。
涉及的步骤:
首先,我们在这里介绍新的积分。因此,我们需要使用
cdist
来获得新旧点之间的平方欧氏距离。这些将被容纳在新输出的两个块中,一个位于旧距离的正下方,一个位于那些旧距离的右侧。我们还需要计算新点中的
pdist
,并将其square-formed
块放在新对角线区域的尾部。
新 D_n_N
的示意图如下所示:
[ D_n cdist.T
cdist New pdist squarefomed]
总而言之,实现看起来是这样的 -
cdists = cdist( xNew, xOld, 'sqeuclidean')
n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
运行时测试
接近 -
# Original approach
def org_app(D_n, xNew):
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)
return D_n_N
# Proposed approach
def proposed_app(D_n, xNew, N):
cdists = cdist( xNew, xOld, 'sqeuclidean')
n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
return out
计时 -
In [102]: # Setup inputs
...: p = 3
...: n = 5000
...: xOld = np.random.rand(n * p).reshape([n, p])
...:
...: sq_dists = pdist(xOld, 'sqeuclidean')
...: D_n = squareform(sq_dists)
...:
...: N = 3000
...: xNew = np.random.rand(N * p).reshape([N, p])
...:
In [103]: np.allclose( proposed_app(D_n, xNew, N), org_app(D_n, xNew))
Out[103]: True
In [104]: %timeit org_app(D_n, xNew)
1 loops, best of 3: 541 ms per loop
In [105]: %timeit proposed_app(D_n, xNew, N)
1 loops, best of 3: 201 ms per loop
只需使用 cdist :
D_OO=cdist(xOld,xOld)
D_NN=cdist(xNew,xNew)
D_NO=cdist(xNew,xOld)
D_ON=cdist(xOld,xNew) # or D_NO.T
最后:
D_=vstack((hstack((D_OO,D_ON)),(hstack((D_NO,D_NN)))))