为 kmeans 质心初始化找到 numpy 数组的边缘点

Question

我正致力于在 python 中实现 kmeans 算法。我正在测试初始化质心的新方法，并想实施它并查看它会对集群产生什么影响。

我的想法是 select 我的数据集中的数据点，其质心被初始化为我数据的边缘点。

简单示例2属性示例：

假设这是我的输入数组

input = array([[3,3], [1,1], [-1,-1], [3,-3], [-1,1], [-3,3], [1,-1], [-3,-3]])

我想从这个数组中 select 边缘点，即 [3,3] [-3,-3] [-3,3] [3,-3]。所以如果我的 k 是 4，这些点将是 selected

在我处理的数据中，我的数据集中有 4 个和 9 个属性以及大约 300 个数据点

注意：当 k <> 边缘点时，我还没有找到解决方案，但如果 k 是 > 边缘点，我想我会 select 这 4 个点，然后尝试将其余点放在图的中心点

我也考虑过为每一列找到最大值和最小值，并从那里尝试找到我的数据集的边缘，但我不知道从这些值中识别边缘的有效方法。

如果您认为这个想法行不通，我很想听听您的意见。

问题

numpy有没有这样的函数来获取我的数据集边缘数据点的索引？
如果没有，我将如何在我的数据集中找到这些边缘点？

Answer 1

使用 scipy 和成对的距离来找出每个人与另一个人之间的距离：

from scipy.spatial.distance import pdist, squareform
p=pdist(input)

然后，使用sqaureform将p向量变成矩阵形状：

s=squareform(pdist(input))

然后，使用 numpy argwhere 查找值最大或极值的索引，然后在输入数组中查找这些索引：

input[np.argwhere(s==np.max(p))]

array([[[ 3,  3],
        [-3, -3]],

       [[ 3, -3],
        [-3,  3]],

       [[-3,  3],
        [ 3, -3]],

       [[-3, -3],
        [ 3,  3]]])

完整代码为：

from scipy.spatial.distance import pdist, squareform
p=pdist(input)
s=squareform(p)
input[np.argwhere(s==np.max(p))]

为 kmeans 质心初始化找到 numpy 数组的边缘点

Find edge points of numpy array for kmeans centroids initialization

arrays

numpy

initialization

k-means

centroid