Rclusterpp.hclust 使用 cutree 时没有提供正确的集群

Question

我有一个相当大的数据集，包含大约 75,000 个观测值和 7 列，其中包含 stats:hclust 无法支持的警报数据详细信息（崩溃 RStudio）。通过一些搜索，我发现 Rclusterpp.hclust 据报道可以降低层次聚类的复杂性和资源分配，所以我试了一下。它需要大约 5 分钟左右，并且确实提供了树状图，但是如果我尝试使用 cutree 并指定高度或簇数，我会得到奇怪的结果。当使用 38 个观察的小样本时，我看到了同样的问题，如下所示。我做错了什么还是 Rclusterpp.hclust 包有问题？ (运行R 3.4.1 中的 ning 包 3.4.1)

示例数据集如下所示：

dataset
#   DAY COUNT LOCATION M1 M2 HOURS SOURCE
#1  238     2   222307  1  1  5437   1008
#2  238     1   222307  2  1  5437   1008
#3  238     5   222307  3  2  5437   1008
#4  238     2   222307  4  3  5437   1008
#5  238    14   222307  5  1  5437   1008
#6  238     4   222307  5  1  5437   1008
#7  238    14   222307  6  2  5437   1008
#8  238     3   222307  1  1  5437   1008
#9  238     1   222307  2  1  5437   1008
#10 238     1   222307  4  3  5437   1008
#11 238     2   222307  4  3  5437   1008
#12 238     2   222307  4  3  5437   1008
#13 238     5   222307  5  1  5437   1008
#14 238    11   222307  5  1  5437   1008
#15 238     1   222307  5  1  5437   1008
#16 238     3   222307  5  1  5437   1008
#17 238    18   222307  6  2  5437   1008
#18 238     2   222307  7  4  5437      9
#19 238     2   222307  8  4  5437     10
#20 238     3   222307  9  5  5437   1008
#21 238     2   222307 10  6  5437    865
#22 238     9   222307 11  7  5437     10
#23 238     2   222307 12  7  5437     10
#24 238     1   222307 12  7  5437     10
#25 238     5   222307 11  7  5437     10
#26 238     2   222307  8  4  5437     10
#27 238     3   222307 13  8  5437    864
#28 238     3   222307 14  8  5437    864
#29 238     1   222307 11  7  5437     10
#30 238     3   222307 11  7  5437     10
#31 238     2   222307 15  7  5437     10
#32 238     5   222307 11  7  5437     10
#33 238     2   222307 16  7  5437     10
#34 238     2   222307 17  7  5437     10
#35 238     3   222307 18  7  5437     10
#36 238     2   222307 15  7  5437     10
#37 238     6   222307 11  7  5437     10
#38 238     3   222307 19  7  5437     10

DAY、HOURS和COUNT是实数值，而LOCATION、M1、M2和SOURCE 是数字编码的分类值。

使用stats:hclust我可以得到一个集群，它确实很好地代表了数据，并且确实像预期的那样在这个样本的所有观察中区分了警报事件的 2 个主要集群（即树状图中的观察数字是警报应该组合在一起）：

d1 <- dist((as.matrix(scale(dataset))))
hc1 <- hclust(d1, method = "single")
cutree(hc1,2)
# 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #27 28 29 30 31 32 33 34 35 36 37 38 
# 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  1  1  2  2  2  2  2  #1  1  2  2  2  2  2  2  2  2  2  2 
plot(hc1)

然而，如果我在 Rclusterpp:hclust 中做同样的事情，我得到的簇比我指定的多（在这种情况下，当我要求 2 个时，我得到了 3 个，如这个小样本所示）。当我运行在我的大型数据集上执行此操作时，我只需要几个就得到了将近 20,000 个集群。

d2 <- dist((as.matrix(scale(dataset))))
hc2 <- Rclusterpp.hclust(d2, method = "single")
cutree(hc2,2)
# 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #27 28 29 30 31 32 33 34 35 36 37 38 
# 1  1  1  1  1  1  1  1  1  1  2  2  1  1  1  1  1  3  3  1  1  3  3  3  3  3  #1  1  3  3  3  3  3  3  3  3  3  3 
plot(hc2)

知道为什么会这样吗？谢谢。

Answer 1

我稍微研究了一下，似乎 Rclusterpp.hclust 的 return 值与 stats' 没有完全对齐（关于 merge 矩阵） hclust.

根据 hclust 的文档，returned 列表的 merge 组件是：

an n-1 by 2 matrix. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.

对于cutree的C实现，看来括号里的字（earlier）很重要

查看 head(hc2$merge)，我们看到以下内容：

     [,1] [,2]
[1,]   -2   -9
[2,]  -25  -32
[3,]  -31  -36
[4,]  -19  -26
[5,]   -4    6
[6,]  -11  -12

所以在第五行，有一个"pointer"到第六步，就是往意想不到的方向走。

如果我们 re-arrange merge 组件（交换行和 "pointers"），事情看起来没问题：

# non-generic replacements for specific data example
hc3 <- hc2
hc3$merge[5, ] <- c(-11,-12)
hc3$merge[6, ] <- c(-4,5)
hc3$merge[13, ] <- c(-10,6)
cutree(hc3, 2)

您可以编写一个函数来处理 merge 矩阵的这个 re-structuring，这样事情总是按照您的意愿进行（可能是 cutree 的包装器）。

最后注意Github上有一个关于这个的issue，你可以在那里找到一些讨论和cross-package比较：
https://github.com/nolanlab/Rclusterpp/issues/4

Rclusterpp.hclust 使用 cutree 时没有提供正确的集群

Rclusterpp.hclust not providing correct clusters when using cutree

r

cluster-analysis

hierarchical-clustering