使用单一链接算法时如何列出所有当前集群？

Question

我现在正在使用 from scipy.cluster.hierarchy import linkage 在 python 上进行聚类从手册中我知道它以这种形式给出结果 --> [A, B, length, #] 哪个 A 和 B 是将要在此阶段合并的元素的索引（？），但是我可以获得有关已合并但不打算参与此阶段的集群的任何信息吗？

比如我的数据集是：

A=[[1,1],[1,2],[1,3],[1,4],[1,5],
   [10,1],[10,2],[10,3],[10,4],[10,5],
   [15,1],[15,2],[15,3],[15,4],[15,5],
   [30,1],[30,2],[30,3],[30,4],[30,5]]

并对其应用单链接算法

Z = linkage(A, 'single')

Z=[[  0.   4.   1.   2.]
   [  1.  20.   1.   3.]
   [  2.  21.   1.   4.]
   [  3.  22.   1.   5.]
   [ 17.  19.   1.   2.]
   [  5.   9.   1.   2.]
   [  6.  25.   1.   3.]
   [  7.  26.   1.   4.]
   [  8.  27.   1.   5.]
   [ 18.  24.   1.   3.]
   [ 10.  14.   1.   2.]
   [ 11.  30.   1.   3.]
   [ 12.  31.   1.   4.]
   [ 13.  32.   1.   5.]
   [ 16.  29.   1.   4.]
   [ 15.  34.   1.   5.]
   [ 28.  33.   5.  10.]
   [ 23.  36.   9.  15.]
   [ 35.  37.  15.  20.]]

这里我选择5作为聚类的距离限制，所以我得到

[ 28. 33. 5. 10.]

然后我追踪 28 和 33 回到原始索引

cut = 5
temp1 = []
temp2 = []
for i in range(len(Z)):
if Z[i][2] >= cut:
    temp1.append(Z[i])
for i in range(2):
    temp2[i].append(int(temp1[0][i]))
for j in range(0, len(temp2)):
try:
    g = max(temp2[j])
except:
    continue
G = int(g - len(A))
while g >= len(A):
    ind = temp2[j].index(g)
    temp2[j].append(int(Z[G][0]))
    temp2[j].append(int(Z[G][1]))
    del temp2[j][ind]
    g = max(temp2[j])
    G = int(g - len(A))

并发现

temp2 = [[8, 7, 6, 5, 9], [13, 12, 11, 10, 14]]

这意味着'28'代表点[10,1],[10,2],[10,3],[10,4],[10,5]和'33'代表点[15,1],[15,2],[15,3],[15,4],[15,5]，这显然意味着簇由[10，x]组成并且簇由[ 15,x] 将在这个阶段合并。

但显然[1,1],[1,2],[1,3],[1,4],[1,5]和[30,1],[30,2],[30,3],[30,4],[30,5]在前期肯定又形成了另外两个簇，所以在[10,x]和[15,x]合并之前的时刻，目前有4个集群

所以我想要的结果是这样的：

temp2 = [[8, 7, 6, 5, 9], [13, 12, 11, 10, 14], [0, 1, 2, 3, 4], [15, 16, 17, 18, 19]]

后面两个簇要怎么搞T^T？？

Answer 1

如the documentation, linkage gives you the distance between clusters, which is the same as the cophenetic distance between elements in those clusters. As described in other documentation所述，fcluster会给你平坦的聚类，如果你指定'distance'作为标准，将根据共生距离切割树状图。

因此，您可以通过使用 fcluster 在您选择的距离处对簇进行阈值来获得您想要的结果。然而，一个小问题是 fcluster 将阈值视为聚集的最高距离，而不是分裂的最低距离，因此如果您使用 5 作为阈值，它将加入您所指的两个集群并给出你只有三个集群。您必须选择略小于 5 的阈值才能获得您想要的结果。例如：

from scipy.cluster import hierarchy as clust
>>> clust.fcluster(Z, 4.99999, criterion='distance')
array([2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1])

这告诉您每个项目在哪个集群中。要将其转换回每个集群中的索引列表，您可以使用 np.where:

>>> clusters = clust.fcluster(Z, 4.99999, criterion='distance')
>>> [np.where(clusters==k)[0].tolist() for k in np.unique(clusters)]
[[15L, 16L, 17L, 18L, 19L],
 [0L, 1L, 2L, 3L, 4L],
 [5L, 6L, 7L, 8L, 9L],
 [10L, 11L, 12L, 13L, 14L]]

总而言之，我们的想法是查看您所说的 "distance limitation" 并使用 fclust 以该距离（或者更确切地说，稍小的距离）作为阈值来获得平坦的聚类。这将为您提供每个索引的簇号，然后您可以使用 np.where 获取每个簇的列表。

使用单一链接算法时如何列出所有当前集群？

How can I list out all current clusters when using a single linkage algorithm?

python

algorithm

hierarchical-clustering

linkage