R 中的命名簇

Question

我在 R 中使用鸢尾花数据集。我使用 K 均值对数据进行聚类；输出是变量 km.out。但是，我找不到一种简单的方法将簇号 (1-3) 分配给一个物种（versicolor、setosa、virginica）。我创建了一种手动方式来执行此操作，但我必须设置种子并且它非常手动。必须有更好的方法来做到这一点。有什么想法吗？

这是我手动做的：

for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 1) {
    km.out$cluster[i] = "versicolor"
  }
}
for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 2) {
    km.out$cluster[i] = "setosa"
  }
}
for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 3) {
    km.out$cluster[i] = "virginica"
  }
}

Answer 1

您可以重新编码簇号并将其添加回原始数据：

library(dplyr)
mutate(iris, 
       cluster = case_when(km.out$cluster == 1 ~ "versicolor",
                           km.out$cluster == 2 ~ "setosa",
                           km.out$cluster == 3 ~ "virginica"))

或者，您可以使用矢量翻译方法通过 elucidate::translate()

重新编码矢量

remotes::install_github("bcgov/elucidate") #if elucidate isn't installed yet
library(dplyr)
library(elucidate)

mutate(iris, 
       cluster = translate(km.out$cluster, 
                           old = c(1:3), 
                           new =  c("versicolor", 
                                    "setosa", 
                                    "virginica")))

Answer 2

R是一种向量化语言，下面一行相当于题中的代码

km.out$cluster <- c("versicolor", "setosa", "virginica")[km.out$cluster]

Answer 3

不清楚您要完成什么。 kmeans 创建的集群将不会与 Species 完全匹配，并且无法保证集群 1、2、3 将与 iris 中的物种顺序匹配。同样如您所述，结果将根据种子的价值而有所不同。例如，

set.seed(42)
iris.km <- kmeans(scale(iris[, -5]), 3)
table(iris.km$cluster, iris$Species)
#    
#     setosa versicolor virginica
#   1     50          0         0
#   2      0         39        14
#   3      0         11        36

簇 1 与 setosa 完全相关，但簇 2 与簇 3 一样结合了云芝和弗吉尼亚。

Answer 4

如果您想将簇编号 (1-3) 分配给一个物种（云芝、山毛榉、维吉尼亚），您可能没有 1:1 一致。但是您可以像这样分配每个集群中最常见的物种：

data(iris)

# k-means clustering
set.seed(5834)
km.out <- kmeans(iris[,1:4], centers = 3)

# associate species with clusters
(cmat <- table(Species = iris[,5], cluster = km.out$cluster))
#>             cluster
#> Species       1  2  3
#>   setosa     33 17  0
#>   versicolor  0  4 46
#>   virginica   0  0 50

# find the most-frequent species in each cluster
setNames(rownames(cmat)[apply(cmat, 2, which.max)], colnames(cmat))
#>           1           2           3 
#>    "setosa"    "setosa" "virginica"

# find the most-frequent assigned cluster per species
setNames(colnames(cmat)[apply(cmat, 1, which.max)], rownames(cmat))
#>     setosa versicolor  virginica 
#>        "1"        "3"        "3"

^{由 reprex package (v2.0.1)}

于 2021-09-22 创建

R 中的命名簇

Naming clusters in R

r

cluster-analysis

k-means