创建最近的中心项目列表

Question

我目前正在做一个在大型数据集上使用 k-means 的项目。我想稍微扩展一下我的大脑，不使用任何外部库，只通过创建我自己的函数来做到这一点。我已经走了很远，但遇到了一个问题，即不打算根据聚类中心所在的位置创建列表。

为了方便起见，我在下面创建了一小部分数据供使用，而不是使用我拥有的整个数据集

dataset1 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323), 
            (4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)]

cluster_1 = (0, 1)
cluster_2 = (1, 2)
clusters = [cluster_1, cluster_2] # although clusters not near data, it is to practise my model

下面我有 3 个函数与创建聚类中心点的过程有关

计算数据和聚类中心之间的距离，其中 dataset 中的每个点与 cluster_list

def calculate_distance(point1, point2):
    distance = 0
    for i in range(len(point1)):
        # Euclidian distance formula
        distance += (point1[i] - point2[i])**2
    # result then square rooted for distance
    return distance**0.5
    # end of function

判断某个点离哪个聚类中心最近

def find_nearest_centre(dataset1, clusters):
    nearest_point = []
    min_distance = 100000
    # obtaining sample from cluster list
    for c in clusters:
        # using distance formula above to calculate distance between points
        distance = calculate_distance(c, dataset)
        if distance < min_distance:
            min_distance = distance
        nearest_point.append(min_distance)
        
    return nearest_point

创建两个列表，每个列表对应一个集群，其中包含属于该集群的数据坐标。

def create_list(dataset1, clusters):
    # new lists created for 2 clusters
    list_1 = []
    list_2 = []
    for d in dataset1:
        # using nearest_centre formula to determine which points are closest to centres
        nearest_centre = find_nearest_centre(d, clusters)
        # adding closest coordinates to list_1 for cluster 1 and list_2 for cluster 2
        if nearest_centre == clusters[0]:
            list_1.append(d)
        elif nearest_centre == clusters[1]:
            list_2.append(d)
        
    return list_1, list_2

现在来谈谈我的问题。当我运行函数 create_list 它只创建两个空列表，而不是按预期附加每个坐标。虽然不现实，但如果前 3 个值在第一个集群中，而最后 3 个值最接近第二个集群，则所需的输出将是：

create_list(dataset1, clusters) # this is only function needed to operate ideally

list_1 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323)] # list of tuples output
list_2 = [(4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)] # list of tuples output

如果能得到任何帮助，我将不胜感激，显然坚持不使用外部包的主题。谢谢！

Answer 1

您得到的列表是空的，因为您正在将聚类与点进行比较，因此没有可能的匹配项。

Return最近的簇而不是来自

的点

def find_nearest_centre(dataset, clusters):
    min_distance = float("inf")
    # obtaining sample from cluster list
    for c in clusters:
        # using distance formula above to calculate distance between points
        distance = calculate_distance(c, dataset)
        if distance < min_distance:
            min_distance = distance
            nearest_cluster = c

    return nearest_cluster

然后将集群与集群进行比较

def create_list(dataset1, clusters):
    # new lists created for 2 clusters
    list_1 = []
    list_2 = []
    for d in dataset1:
        # using nearest_centre formula to determine which points are closest to centres
        nearest_cluster = find_nearest_centre(d, clusters)
        # adding closest coordinates to list_1 for cluster 1 and list_2 for cluster 2
        if nearest_cluster == clusters[0]:
            list_1.append(d)
        elif nearest_cluster == clusters[1]:
            list_2.append(d)
        else:
            print("No match")

    return list_1, list_2

输出结果与您预期的不一样，但我认为 cluster_1 在这种情况下应该总是更接近。

list_1 = []
list_2 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323), (4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)]

创建最近的中心项目列表

Creating a list of nearest centre items

python

k-means