scipy 成对距离与 X.X+Y.Y - X.Y^t 之间的差异
Difference between scipy pairwise distance and X.X+Y.Y - X.Y^t
假设我们有数据
d1 = np.random.uniform(low=0, high=2, size=(3,2))
d2 = np.random.uniform(low=3, high=5, size=(3,2))
X = np.vstack((d1,d2))
X
array([[ 1.4930674 , 1.64890721],
[ 0.40456265, 0.62262546],
[ 0.86893397, 1.3590808 ],
[ 4.04177045, 4.40938126],
[ 3.01396153, 4.60005842],
[ 3.2144552 , 4.65539323]])
我想比较两种生成成对距离的方法:
假设X和Y相同:
(X-Y)^2 = X.X + Y.Y - 2*X.Y^t
这是第一种方法,因为它在 scikit-learn 中用于计算成对距离,后来用于核矩阵。
import numpy as np
def cal_pdist1(X):
Y = X
XX = np.einsum('ij,ij->i', X, X)[np.newaxis, :]
YY = XX.T
distances = -2*np.dot(X, Y.T)
distances += XX
distances += YY
return(distances)
cal_pdist1(X)
array([[ 0. , 2.2380968 , 0.47354188, 14.11610424,
11.02241244, 12.00213414],
[ 2.2380968 , 0. , 0.75800718, 27.56880003,
22.62893544, 24.15871196],
[ 0.47354188, 0.75800718, 0. , 19.37122424,
15.1050792 , 16.36714548],
[ 14.11610424, 27.56880003, 19.37122424, 0. ,
1.09274896, 0.74497242],
[ 11.02241244, 22.62893544, 15.1050792 , 1.09274896,
0. , 0.04325965],
[ 12.00213414, 24.15871196, 16.36714548, 0.74497242,
0.04325965, 0. ]])
现在,如果我使用 scipy 成对距离函数,我会得到
import scipy, scipy.spatial
pd_sparse = scipy.spatial.distance.pdist(X, metric='seuclidean')
scipy.spatial.distance.squareform(pd_sparse)
array([[ 0. , 0.92916653, 0.45646989, 2.29444795, 1.89740167,
2.00059442],
[ 0.92916653, 0. , 0.50798432, 3.22211357, 2.78788236,
2.90062103],
[ 0.45646989, 0.50798432, 0. , 2.72720831, 2.28001564,
2.39338343],
[ 2.29444795, 3.22211357, 2.72720831, 0. , 0.71411943,
0.58399694],
[ 1.89740167, 2.78788236, 2.28001564, 0.71411943, 0. ,
0.14102567],
[ 2.00059442, 2.90062103, 2.39338343, 0.58399694, 0.14102567,
0. ]])
结果完全不同!他们不应该是一样的吗?
pdist(..., metric='seuclidean')
计算 standardized Euclidean 距离,而不是 squared Euclidean 距离(cal_pdist
returns).
来自 the docs:
Y = pdist(X, 'seuclidean', V=None)
Computes the standardized Euclidean distance. The standardized Euclidean distance between two n-vectors u
and v
is
__________________
√∑(ui−vi)^2 / V[xi]
V
is the variance vector; V[i]
is the variance computed over all the i
’th components of the points. If not passed, it is automatically computed.
尝试传递 metric='sqeuclidean'
,您将看到两个函数 return 相同的结果在舍入误差范围内。
假设我们有数据
d1 = np.random.uniform(low=0, high=2, size=(3,2))
d2 = np.random.uniform(low=3, high=5, size=(3,2))
X = np.vstack((d1,d2))
X
array([[ 1.4930674 , 1.64890721],
[ 0.40456265, 0.62262546],
[ 0.86893397, 1.3590808 ],
[ 4.04177045, 4.40938126],
[ 3.01396153, 4.60005842],
[ 3.2144552 , 4.65539323]])
我想比较两种生成成对距离的方法:
假设X和Y相同:
(X-Y)^2 = X.X + Y.Y - 2*X.Y^t
这是第一种方法,因为它在 scikit-learn 中用于计算成对距离,后来用于核矩阵。
import numpy as np
def cal_pdist1(X):
Y = X
XX = np.einsum('ij,ij->i', X, X)[np.newaxis, :]
YY = XX.T
distances = -2*np.dot(X, Y.T)
distances += XX
distances += YY
return(distances)
cal_pdist1(X)
array([[ 0. , 2.2380968 , 0.47354188, 14.11610424,
11.02241244, 12.00213414],
[ 2.2380968 , 0. , 0.75800718, 27.56880003,
22.62893544, 24.15871196],
[ 0.47354188, 0.75800718, 0. , 19.37122424,
15.1050792 , 16.36714548],
[ 14.11610424, 27.56880003, 19.37122424, 0. ,
1.09274896, 0.74497242],
[ 11.02241244, 22.62893544, 15.1050792 , 1.09274896,
0. , 0.04325965],
[ 12.00213414, 24.15871196, 16.36714548, 0.74497242,
0.04325965, 0. ]])
现在,如果我使用 scipy 成对距离函数,我会得到
import scipy, scipy.spatial
pd_sparse = scipy.spatial.distance.pdist(X, metric='seuclidean')
scipy.spatial.distance.squareform(pd_sparse)
array([[ 0. , 0.92916653, 0.45646989, 2.29444795, 1.89740167,
2.00059442],
[ 0.92916653, 0. , 0.50798432, 3.22211357, 2.78788236,
2.90062103],
[ 0.45646989, 0.50798432, 0. , 2.72720831, 2.28001564,
2.39338343],
[ 2.29444795, 3.22211357, 2.72720831, 0. , 0.71411943,
0.58399694],
[ 1.89740167, 2.78788236, 2.28001564, 0.71411943, 0. ,
0.14102567],
[ 2.00059442, 2.90062103, 2.39338343, 0.58399694, 0.14102567,
0. ]])
结果完全不同!他们不应该是一样的吗?
pdist(..., metric='seuclidean')
计算 standardized Euclidean 距离,而不是 squared Euclidean 距离(cal_pdist
returns).
来自 the docs:
Y = pdist(X, 'seuclidean', V=None)
Computes the standardized Euclidean distance. The standardized Euclidean distance between two n-vectors
u
andv
is__________________ √∑(ui−vi)^2 / V[xi]
V
is the variance vector;V[i]
is the variance computed over all thei
’th components of the points. If not passed, it is automatically computed.
尝试传递 metric='sqeuclidean'
,您将看到两个函数 return 相同的结果在舍入误差范围内。