在 python 成对距离的层次聚类中,我如何切入特定距离并获取聚类和每个聚类的成员列表?

In python hierarchical clustering by pairwise distances, how can I cut on specific distances and get clusters and list of members of each cluster?

我有这样的成对距离数据:

distances = {

('DN1357_i2', 'DN1357_i5'): 1.0,

('DN1357_i2', 'DN10172_i1'): 28.0,

('DN1357_i2', 'DN1357_i1'): 8.0,

('DN1357_i5', 'DN1357_i1'): 2.0,

('DN1357_i5', 'DN10172_i1'): 34.0,

('DN1357_i1', 'DN10172_i1'): 38.0,
}

所以我有 4 个对象,我使用以下代码行对这些对象进行了聚类:

keys = [sorted(k) for k in obj_distances.keys()]

values = obj_distances.values()

sorted_keys, distances = zip(*sorted(zip(keys, values)))

Z = linkage(distances)

labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendro = dendrogram(Z, labels=labels)

它给了我一个树状图。获取簇和每个簇中对象名称的代码是什么(如果我在距离 2 处切割树状图)?

您可以使用 scipy 函数 cut_tree,以下是您的数据示例:

from scipy.cluster.hierarchy import cut_tree, dendrogram, linkage

obj_distances = {
    ('DN1357_i2', 'DN1357_i5'): 1.0,
    ('DN1357_i2', 'DN10172_i1'): 28.0,
    ('DN1357_i2', 'DN1357_i1'): 8.0,
    ('DN1357_i5', 'DN1357_i1'): 2.0,
    ('DN1357_i5', 'DN10172_i1'): 34.0,
    ('DN1357_i1', 'DN10172_i1'): 38.0,
}

keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

Z = linkage(distances)

labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))
dendro = dendrogram(Z, labels=labels)

members = dendro['ivl']
clusters = cut_tree(Z, height=2)
cluster_ids = [c[0] for c in clusters]

for k in range(max(cluster_ids) + 1):
    print(f"Cluster {k}")
    for i, c in enumerate(cluster_ids):
        if c == k:
            print(f"{members[i]}")

    print('\n')

砍树高度为2,输出为:

Cluster 0
DN10172_i1


Cluster 1
DN1357_i1


Cluster 2
DN1357_i2
DN1357_i5