Scipy - 计算马氏距离时的Nan
Scipy - Nan when calculating Mahalanobis distance
当我尝试使用以下 python 代码计算马氏距离时,我在结果中得到了一些 Nan 条目。你知道为什么会这样吗?
我的 data.shape = (181, 1500)
from scipy.spatial.distance import pdist, squareform
data_log = log2(data + 1) # A log transform that I usually apply to my data
data_centered = data_log - data_log.mean(0) # zero centering
D = squareform( pdist(data_centered, 'mahalanobis' ) )
我也试过:
data_standard = data_centered / data_centered.std(0, ddof=1)
D = squareform( pdist(data_standard, 'mahalanobis' ) )
也有nans。
输入没有损坏,其他距离,例如相关距离,可以很好地计算出来。
出于某种原因,当我减少功能数量时,我就不再使用 Nans。例如,以下示例没有得到任何 Nan:
D = squareform( pdist(data_centered[:,:200], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:480], 'mahalanobis' ) )
而其他人得到 Nans:
D = squareform( pdist(data_centered[:,:300], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:600], 'mahalanobis' ) )
有线索吗?如果不满足输入的某些条件,这是预期的行为吗?
您的观测值少于特征值,因此 scipy 代码计算的协方差矩阵 V
是奇异的。代码不检查这一点,而是盲目地计算协方差矩阵的 "inverse"。因为这个数值计算的逆函数基本上是垃圾,所以乘积 (x-y)*inv(V)*(x-y)
(其中 x
和 y
是观测值)可能会变成负数。然后该值的平方根导致 nan
。
例如,此数组还会产生 nan
:
In [265]: x
Out[265]:
array([[-1. , 0.5, 1. , 2. , 2. ],
[ 2. , 1. , 2.5, -1.5, 1. ],
[ 1.5, -0.5, 1. , 2. , 2.5]])
In [266]: squareform(pdist(x, 'mahalanobis'))
Out[266]:
array([[ 0. , nan, 1.90394328],
[ nan, 0. , nan],
[ 1.90394328, nan, 0. ]])
这是完成的马氏计算"by hand":
In [279]: V = np.cov(x.T)
理论上,V
是单数;以下值实际上是 0:
In [280]: np.linalg.det(V)
Out[280]: -2.968550671342364e-47
但是inv
没看出问题,returns反了:
In [281]: VI = np.linalg.inv(V)
让我们计算 x[0]
和 x[2]
之间的距离,并验证我们在使用 VI
时得到 pdist
返回的相同非 nan 值 (1.9039) :
In [295]: delta = x[0] - x[2]
In [296]: np.dot(np.dot(delta, VI), delta)
Out[296]: 3.625
In [297]: np.sqrt(np.dot(np.dot(delta, VI), delta))
Out[297]: 1.9039432764659772
当我们尝试计算 x[0]
和 x[1]
之间的距离时会发生以下情况:
In [300]: delta = x[0] - x[1]
In [301]: np.dot(np.dot(delta, VI), delta)
Out[301]: -1.75
然后该值的平方根给出 nan
。
在scipy0.16(将于2015年6月发布)中,你会得到一个错误,而不是nan或garbage。错误消息描述了问题:
In [4]: x = array([[-1. , 0.5, 1. , 2. , 2. ],
...: [ 2. , 1. , 2.5, -1.5, 1. ],
...: [ 1.5, -0.5, 1. , 2. , 2.5]])
In [5]: pdist(x, 'mahalanobis')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-a3453ff6fe48> in <module>()
----> 1 pdist(x, 'mahalanobis')
/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1298 "singular. For observations with %d "
1299 "dimensions, at least %d observations "
-> 1300 "are required." % (m, n, n + 1))
1301 V = np.atleast_2d(np.cov(X.T))
1302 VI = _convert_to_double(np.linalg.inv(V).T.copy())
ValueError: The number of observations (3) is too small; the covariance matrix is singular. For observations with 5 dimensions, at least 6 observations are required.
当我尝试使用以下 python 代码计算马氏距离时,我在结果中得到了一些 Nan 条目。你知道为什么会这样吗? 我的 data.shape = (181, 1500)
from scipy.spatial.distance import pdist, squareform
data_log = log2(data + 1) # A log transform that I usually apply to my data
data_centered = data_log - data_log.mean(0) # zero centering
D = squareform( pdist(data_centered, 'mahalanobis' ) )
我也试过:
data_standard = data_centered / data_centered.std(0, ddof=1)
D = squareform( pdist(data_standard, 'mahalanobis' ) )
也有nans。 输入没有损坏,其他距离,例如相关距离,可以很好地计算出来。 出于某种原因,当我减少功能数量时,我就不再使用 Nans。例如,以下示例没有得到任何 Nan:
D = squareform( pdist(data_centered[:,:200], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:480], 'mahalanobis' ) )
而其他人得到 Nans:
D = squareform( pdist(data_centered[:,:300], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:600], 'mahalanobis' ) )
有线索吗?如果不满足输入的某些条件,这是预期的行为吗?
您的观测值少于特征值,因此 scipy 代码计算的协方差矩阵 V
是奇异的。代码不检查这一点,而是盲目地计算协方差矩阵的 "inverse"。因为这个数值计算的逆函数基本上是垃圾,所以乘积 (x-y)*inv(V)*(x-y)
(其中 x
和 y
是观测值)可能会变成负数。然后该值的平方根导致 nan
。
例如,此数组还会产生 nan
:
In [265]: x
Out[265]:
array([[-1. , 0.5, 1. , 2. , 2. ],
[ 2. , 1. , 2.5, -1.5, 1. ],
[ 1.5, -0.5, 1. , 2. , 2.5]])
In [266]: squareform(pdist(x, 'mahalanobis'))
Out[266]:
array([[ 0. , nan, 1.90394328],
[ nan, 0. , nan],
[ 1.90394328, nan, 0. ]])
这是完成的马氏计算"by hand":
In [279]: V = np.cov(x.T)
理论上,V
是单数;以下值实际上是 0:
In [280]: np.linalg.det(V)
Out[280]: -2.968550671342364e-47
但是inv
没看出问题,returns反了:
In [281]: VI = np.linalg.inv(V)
让我们计算 x[0]
和 x[2]
之间的距离,并验证我们在使用 VI
时得到 pdist
返回的相同非 nan 值 (1.9039) :
In [295]: delta = x[0] - x[2]
In [296]: np.dot(np.dot(delta, VI), delta)
Out[296]: 3.625
In [297]: np.sqrt(np.dot(np.dot(delta, VI), delta))
Out[297]: 1.9039432764659772
当我们尝试计算 x[0]
和 x[1]
之间的距离时会发生以下情况:
In [300]: delta = x[0] - x[1]
In [301]: np.dot(np.dot(delta, VI), delta)
Out[301]: -1.75
然后该值的平方根给出 nan
。
在scipy0.16(将于2015年6月发布)中,你会得到一个错误,而不是nan或garbage。错误消息描述了问题:
In [4]: x = array([[-1. , 0.5, 1. , 2. , 2. ],
...: [ 2. , 1. , 2.5, -1.5, 1. ],
...: [ 1.5, -0.5, 1. , 2. , 2.5]])
In [5]: pdist(x, 'mahalanobis')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-a3453ff6fe48> in <module>()
----> 1 pdist(x, 'mahalanobis')
/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1298 "singular. For observations with %d "
1299 "dimensions, at least %d observations "
-> 1300 "are required." % (m, n, n + 1))
1301 V = np.atleast_2d(np.cov(X.T))
1302 VI = _convert_to_double(np.linalg.inv(V).T.copy())
ValueError: The number of observations (3) is too small; the covariance matrix is singular. For observations with 5 dimensions, at least 6 observations are required.