对于某些特定索引，scipy 的 pdist 函数是否有特定用途？

Question

我的问题是关于 scipy.spatial.distance 的 pdist 函数的使用。尽管我必须计算一个 1x64 向量与存储在二维数组中的其他数百万个 1x64 向量中的每一个向量之间的汉明距离，但我无法使用 pdist 来计算。因为它 returns 同一二维数组内任意两个向量之间的汉明距离。我想知道是否有任何方法可以让它计算特定索引向量与所有其他向量之间的汉明距离。

这是我当前的代码，我现在使用 1000x64，因为大数组会出现内存错误。

import numpy as np
from scipy.spatial.distance import pdist


ph = np.load('little.npy')

print pdist(ph, 'hamming').shape

输出为

(499500,)

little.npy 有一个 1000x64 数组。例如，如果我只想查看 31.vector 和所有其他的汉明距离。我该怎么办？

Answer 1

您可以使用 cdist。例如，

In [101]: from scipy.spatial.distance import cdist

In [102]: x
Out[102]: 
array([[0, 1, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 1, 1, 0, 0],
       [1, 0, 1, 1, 0, 1, 1, 0],
       [1, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 0, 0, 1, 1, 1, 0],
       [1, 0, 0, 1, 1, 0, 0, 1]])

In [103]: index = 3

In [104]: cdist(x[index:index+1], x, 'hamming')
Out[104]: 
array([[ 0.625,  0.375,  0.5  ,  0.   ,  0.125,  0.75 ,  0.375,  0.375,
         0.5  ,  0.625]])

这给出了索引 3 处的行与所有其他行（包括索引 3 处的行）之间的汉明距离。结果是一个二维数组，只有一行。您可能想立即拉出该行，因此结果为 1D:

In [105]: cdist(x[index:index+1], x, 'hamming')[0]
Out[105]: 
array([ 0.625,  0.375,  0.5  ,  0.   ,  0.125,  0.75 ,  0.375,  0.375,
        0.5  ,  0.625])

我使用了 x[index:index+1] 而不是 x[index]，因此输入是一个二维数组（只有一行）：

In [106]: x[index:index+1]
Out[106]: array([[1, 0, 1, 1, 0, 1, 1, 0]])

如果你使用 x[index]，你会得到一个错误。

对于某些特定索引，scipy 的 pdist 函数是否有特定用途？

Is there a specific use of pdist function of scipy for some particular indexes?

python

scipy

pdist