手动创建树状图:如何修复 plot.hclust 中的“'merge' 矩阵具有无效内容”?

Creating dendrograms manually: how to fix "'merge' matrix has invalid contents" in plot.hclust?

我正在手动创建一个 hclust 对象(即创建一个包含所需插槽的列表,然后将其 class 更改为 hclust)。合并模式、分叉高度、叶节点的排序和叶节点的标签是已知的。我的目标(和测试方法)是绘制生成的树状图。我无法使用我的参数创建可绘制的 hclust 对象。

hclust 对象的组件在 hclust 函数文档中进行了描述 here(参见 部分)。

以下是我用来生成和绘制树状图的 R 代码的可重现块。

tree <- list()
tree$merge <- matrix(c( -1,  -7,  # row  1
                        -2,  -6,  # row  2
                        -3, -12,  # row  3
                        -4, -14,  # row  4
                        -5,  -8,  # row  5
                        -9, -11,  # row  6
                       -13, -20,  # row  7
                       -15, -19,  # row  8
                         1,   8,  # row  9
                         2,   5,  # row 10
                         3,   6,  # row 11
                         2, -18,  # row 12
                         1,   3,  # row 13
                         2,   4,  # row 14
                       -10,   7,  # row 15
                       -16, -17,  # row 16
                         1,   2,  # row 17
                        15,  16,  # row 18
                         1,  15), # row 19
                     ncol = 2,
                     byrow = TRUE)
tree$height <- c(0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.11167131, 0.11167131, 0.11167131, 0.12832304, 0.17304035, 0.17304035, 0.17304035, 0.17304035, 0.22965349, 0.22965349, 0.23334799)
tree$labels <- as.character(1:20)
tree$order <- c(1, 7, 15, 19, 3, 12, 9, 11, 2, 6, 5, 8, 18, 4, 14, 13, 20, 10, 16, 17)
class(tree) <- "hclust"
plot(tree)

tree$merge矩阵的每一行对应一个分叉。负整数是指叶子节点的索引,而正整数是指 tree$merge.

中按行索引存在的簇

运行 代码导致以下错误消息。

Error in plot.hclust(tree) : 'merge' matrix has invalid contents

预期结果的草图如下图所示,heights 值由额外的虚线标记。 (未按比例绘制。)

hclust 树的有效性由 .validity.hclust 函数检查。其源代码已给出here。查看第 121-135 行。

您遇到错误意味着您的树由于其 merge 矩阵而无效。它具有非唯一元素(例如 1 和 2)。在正确构造的 merge 矩阵中,所有条目都是唯一的并且 运行 从 -N_obsN_obs-2 (零排除),其中 N_obs 是一个(正)数的观察。这是由代码中的以下 if 测试检查的:

if(identical(sort(as.integer(merge)), c(-(n:1L), +seq_len(n-2L))))
    TRUE
else
    "'merge' matrix has invalid contents"

来自hclust的参考:

merge an n − 1 by 2 matrix.

Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation − j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.

所有负数条目都是单例(观察),正数是现有聚类的合并,参考算法的合并步骤。

因此,修改您的 hclust 对象。下面是一些代码,可以让您了解一个合适的 hclust 对象是什么样的:

iris2 <- iris[1:20,-5]
species_labels <- iris[,5]
d_iris <- dist(iris2)
tree_iris <- hclust(d_iris, method = "complete")

仔细看看 tree_iris$merge

更新

在我有更多时间后,我决定修复您的代码。我修改了 treemerge 条目。这就是重现树状图的工作代码的样子:

tree <- list()
tree$merge <- matrix(c( -1,  -7,  # row  1
                        -2,  -6,  # row  2
                        -3, -12,  # row  3
                        -4, -14,  # row  4
                        -5,  -8,  # row  5
                        -9, -11,  # row  6
                        -13, -20,  # row  7
                        -15, -19,  # row  8
                        1,   8,  # row  9: 1,7,15,19
                        2,   5,  # row 10: 2,6,5,8
                        3,   6,  # row 11: 3,12,9,11
                        10, -18,  # row 12: 2,6,5,8 + 18
                        9,   11,  # row 13:  1,7,15,19 + 3,12,9,11
                        12,   4,  # row 14: row 12 + row 4
                        -10,   7,  # row 15: row 7 + 10
                        -16, -17,  # row 16
                        13,   14,  # row 17: row 13 + row 14 
                        15,  16,  # row 18: row 15 + row 16
                        17,  18), # row 19: row 17 + row 18
                     ncol = 2,
                     byrow = TRUE)
tree$height <- c(0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.11167131, 0.11167131, 0.11167131, 0.12832304, 0.17304035, 0.17304035, 0.17304035, 0.17304035, 0.22965349, 0.22965349, 0.23334799)
tree$labels <- as.character(1:20)
tree$order <- c(1, 7, 15, 19, 3, 12, 9, 11, 2, 6, 5, 8, 18, 4, 14, 13, 20, 10, 16, 17)
class(tree) <- "hclust"
plot(tree)