手动找到质心和标记数据点之间的距离

Question

我对一些数据 X 进行了聚类分析，得到了标签 y 和质心 c。现在，我正在尝试计算 X 和 他们分配的集群的质心 c 之间的距离。当我们有少量点时这很容易：

import numpy as np

# 10 random points in 3D space
X = np.random.rand(10,3)

# define the number of clusters, say 3
clusters = 3

# give each point a random label 
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)

# randomly assign location of centroids 
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)

# calculate distances
distances = []
for i in range(len(X)):
    distances.append(np.linalg.norm(X[i]-c[y[i][0]]))

不幸的是，实际数据有更多的行。有没有办法以某种方式对其进行矢量化（而不是使用 for loop）？我似乎无法理解映射。

Answer 1

感谢 numpy 的 array indexing，您实际上可以将 for 循环变成单行循环并完全避免显式循环：

distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)

将执行与原始 for 循环相同的操作。

编辑：谢谢@Kris，我忘记了 axis 关键字，因为我没有指定它，numpy 自动计算了整个展平矩阵的范数，而不仅仅是沿着行（轴 1）。我现在更新了它，它应该 return 每个点的距离数组。此外，@Kris 建议将 einsum 用于其特定应用。

手动找到质心和标记数据点之间的距离

Manually find the distance between centroid and labelled data points

python

numpy

cluster-analysis

k-means