R (hclust) 中的聚类分析：如何确定哪个变量驱动聚类

Question

我正在使用 hclust 对采样点的植物物种覆盖数据进行聚类分析。

我的研究观察了 100 个地点 55 个物种的覆盖率。每个地点的植物覆盖度测量为覆盖度类为 0-4，其中 0 不存在，“1”为 1-25% 覆盖度...“4”为 76-100% 覆盖度。

我正在使用欧几里得距离来测量站点之间的物种覆盖差异，我想知道是哪个植物物种在驱动树状图每个分支的分组。请参阅下面的示例 df 和代码；每行代表一个站点。

在简化的示例中，我可以看到 sp1 正在驱动站点 3 和 4 的关联。在我非常大的数据集中，我如何确定哪个物种 is/are 在我的不同级别驱动关联树状图？

如果我能澄清，请告诉我。感谢您的帮助！

library(tidyverse)

site <- c(1:10)
sp1 <- c(0,1,4,4,3,3,2,1,0,2)
sp2 <- c(4,3,0,0,2,2,3,2,1,3)
sp3 <- c(3,2,1,1,2,2,3,2,1,3)
sp4 <- c(2,4,1,0,1,2,3,4,3,1)
df <- data.frame(site, sp1, sp2, sp3, sp4)

species <- select(df, sp1:sp4)

dend <- species %>% 
  dist(method = "euclidean") %>% 
  hclust(method = "ward.D") %>% 
  as.dendrogram()

plot(dend, ylab = "Euclidan Distance")

Answer 1

后续：我最终将每个集群中的站点分配给任意关联组，然后运行使用 indicspecies 的 multipatt 函数对关联组进行指示物种分析。这使我能够识别显着推动不同群体聚集的物种。

clusters <- df %>% mutate(Association = 
                  case_when(site %in% c(3, 4)~1, 
                            site %in% c(2, 8, 9)~2, 
                            site %in% c(1, 5, 6, 7, 10)~3))

abundance = clusters[2:5]
association = clusters$Association

indicator_r.g = multipatt(abundance, association, func = "r.g", control = how(nperm=9999))
summary(indicator_r.g)


Multilevel pattern analysis
 ---------------------------

 Association function: r.g
 Significance level (alpha): 0.05

 Total number of species: 4
 Selected number of species: 4 
 Number of species associated to 1 group: 3 
 Number of species associated to 2 groups: 1 

 List of species associated to each combination: 

 Group 1  #sps.  1 
    stat p.value  
sp1 0.82  0.0193 *

 Group 2  #sps.  1 
     stat p.value  
sp4 0.832  0.0161 *

 Group 3  #sps.  1 
     stat p.value  
sp3 0.781  0.0317 *

 Group 2+3  #sps.  1 
     stat p.value  
sp2 0.844  0.0293 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R (hclust) 中的聚类分析：如何确定哪个变量驱动聚类

Cluster analysis in R (hclust): how to determine which variable is driving the clusters

r

cluster-analysis

hierarchical-clustering

dendrogram

hclust