聚类数据并找到聚类的最小值和最大值

Clustering data and finding minimum and maximum value of a cluster

我有一个包含长二维数组的文本文件。每个的第一个元素的数字介于 1 到 6 之间。

我想聚类线条。考虑到1-6范围内的每个元素都有两个簇,如何为这个数据确定一个簇的最小值和最大值,这里是0到6的范围?

查看蓝色集群,我想确定每个集群的最小值和最大值作为集群的边界。哪种算法可以解决这个问题?我需要为这 6 行的所有集群找到最小值-最大值。

您应该使用 kmeans 进行聚类,并使用一些字典映射来获取 min/max 值:

代码:

import numpy as np
import numpy as np
from scipy.cluster.vq import kmeans, vq
from collections import defaultdict

dd = defaultdict(list)

arr = [[1, 2], [3,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [4,906], [2,5000], [2,358], [4,69], [3,89], [4, 258],[2, 632], [4, 585], [3, 47]]

for k in arr:
  dd[k[0]].append(k[1])  #creating dictionary containing first element of arr as key and last element as value

dd = dict(dd)

在试图理解下面的代码之前,先看看here

"""
This below code creates new dict based on the previous dict data
The dict keys have 2 lists as values, containing min/max value for each cluster
"""

new_dd = defaultdict(list)

for k, v in dd.items():
  codebook, _ = kmeans(np.array(v, dtype=float), 2)  # 2 clusters
  cluster_indices, _ = vq(v, codebook) #creates indices of cluster for each element
  
  #defining 2 clusters
  zero_cluster= []
  one_cluster = []

  for i, val in enumerate(cluster_indices):
    if val == 0:
      zero_cluster.append(v[i])
    else:
      one_cluster.append(v[i])
  min_zero=0
  max_zero=0
  min_one=0
  max_one=0
  if len(zero_cluster)>0:
    min_zero = min(zero_cluster)
    max_zero = max(zero_cluster)
  if len(one_cluster)>0:
    min_one = min(one_cluster)
    max_one = max(one_cluster)

  #adding stats to the new dict based on cluster
  new_dd[k].append([[min_one, max_one],[min_zero, max_zero]])

new_dd = dict(new_dd)
new_dd = {k:v[0] for k,v in new_dd.items()}

print(new_dd)