内存有效平均成对距离

Question

我知道 scipy.spatial.distance.pdist 函数以及如何根据结果 matrix/ndarray 计算平均值。

>>> x = np.random.rand(10000, 2)
>>> y = pdist(x, metric='euclidean')
>>> y.mean()
0.5214255824176626

在上面的示例中，y 变得非常大（几乎是输入数组的 2,500 倍）：

>>> y.shape
(49995000,)
>>> from sys import getsizeof
>>> getsizeof(x)
160112
>>> getsizeof(y)
399960096
>>> getsizeof(y) / getsizeof(x)
2498.0019986009793

但由于我只对平均成对距离感兴趣，因此不必将距离矩阵保存在内存中。相反，可以单独计算每行（或列）的平均值。然后可以根据行平均值计算最终平均值。

是否已经有一个函数可以利用此属性或者是否有一种简单的方法可以 extend/combine 现有函数来这样做？

Answer 1

如果使用距离的平方版本，相当于使用n-1的方差：

from scipy.spatial.distance import pdist, squareform
import numpy as np
x = np.random.rand(10000, 2)
y = np.array([[1,1], [0,0], [2,0]])
print(pdist(x, 'sqeuclidean').mean())
print(np.var(x, 0, ddof=1).sum()*2)
>>0.331474285845873
0.33147428584587346

Answer 2

您必须根据构成平均值的观测值的数量对每一行进行加权。例如，3 x 2 矩阵的 pdist 是方形 3 x 3 距离矩阵的扁平上三角（偏移量为 1）。

arr = np.arange(6).reshape(3,2)
arr
array([[0, 1],
       [2, 3],
       [4, 5]])
pdist(arr)
array([2.82842712, 5.65685425, 2.82842712])
from sklearn.metrics import pairwise_distances
square = pairwise_distances(arr)
square
array([[0.        , 2.82842712, 5.65685425],
       [2.82842712, 0.        , 2.82842712],
       [5.65685425, 2.82842712, 0.        ]])
square[triu_indices(square.shape[0], 1)]
array([2.82842712, 5.65685425, 2.82842712])

有一个 pairwise_distances_chuncked 函数可用于逐行迭代距离矩阵，但您需要跟踪行索引以确保您只取值的平均值矩阵的 upper/lower 三角形（距离矩阵是对称的）。这并不复杂，但我想你会引入一个显着的减速。

tot = ((arr.shape[0]**2) - arr.shape[0]) / 2
weighted_means = 0
for i in gen:
    if r < arr.shape[0]:
        sm = i[0, r:].mean()
        wgt = (i.shape[1] - r) / tot
        weighted_means += sm * wgt
       r += 1

内存有效平均成对距离

Memory efficient mean pairwise distance

python

numpy

scipy

scipy-spatial