Python 二维数组的地球移动器距离

Python Earth Mover Distance of 2D arrays

我想计算两个 2D 阵列(这些不是图像)之间的 Earth Mover 距离

现在我浏览两个库:scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html) and pyemd (https://pypi.org/project/pyemd/).

#define a sampeling method
def sampeling2D(n, mu1, std1, mu2, std2):
   #sample from N(0, 1) in the 2D hyperspace
   x = np.random.randn(n, 2)

   #scale N(0, 1) -> N(mu, std)
   x[:,0] = (x[:,0]*std1) + mu1
   x[:,1] = (x[:,1]*std2) + mu2

   return x

#generate two sets
Y1 = sampeling2D(1000, 0, 1, 0, 1)
Y2 = sampeling2D(1000, -1, 1, -1, 1)

#compute the distance
distance = pyemd.emd_samples(Y1, Y2)

虽然 scipy 版本不接受二维数组并且它 returns 一个错误,但是 pyemd方法returns一个值。如果你从文档中看到,它说它只接受一维数组,所以我认为输出是错误的。在这种情况下,我该如何计算这个距离?

所以如果我没理解错的话,你是在尝试传输采样分布,即计算所有集群权重为 1 的设置的距离。通常,你可以将 EMD 的计算视为一个实例最小成本流,在您的情况下,这归结为 linear assignment problem: Your two arrays are the partitions in a bipartite graph, and the weights between two vertices are your distance of choice. Assuming that you want to use the Euclidean norm as your metric, the weights of the edges, i.e. the ground distances, may be obtained using scipy.spatial.distance.cdist, and in fact SciPy provides a solver for the linear sum assignment problem as well in scipy.optimize.linear_sum_assignment (which recently 看到了 SciPy 1.4 中可用的巨大性能改进。如果您 运行 遇到性能问题,您可能会感兴趣; 1.3 实现对于 1000x1000 输入有点慢)。

换句话说,你想做的事情归结为

from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment

d = cdist(Y1, Y2)
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)

也可以使用 scipy.sparse.csgraph.min_weight_bipartite_full_matching 作为 linear_sum_assignment 的直接替代品;虽然它是为稀疏输入(你的肯定不是)而设计的,但它可能会在某些情况下提供性能改进。

验证此计算的结果是否与您从最小成本流解算器中获得的结果相匹配可能是有益的; NetworkX 中提供了一个这样的求解器,我们可以在其中手动构建图形:

import networkx as nx

G = nx.DiGraph()

# Represent elements in Y1 by 0, ..., 999, and elements in
# Y2 by 1000, ..., 1999.
for i in range(n):
    G.add_node(i, demand=-1)
    G.add_node(n + i, demand=1)

for i in range(n):
    for j in range(n):
        G.add_edge(i, n + j, capacity=1, weight=d[i, j])

至此,我们可以验证上述方法是否符合最小成本流:

In [16]: d[assignment].sum() == nx.algorithms.min_cost_flow_cost(G)
Out[16]: True

同样,对于一维输入,结果与 scipy.stats.wasserstein_distance 一致是有指导意义的:

from scipy.stats import wasserstein_distance

np.random.seed(0)
n = 100

Y1 = np.random.randn(n)
Y2 = np.random.randn(n) - 2
d =  np.abs(Y1 - Y2.reshape((n, 1)))

assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)       # 1.9777950447866477
print(wasserstein_distance(Y1, Y2))  # 1.977795044786648