R使用Purrr Map函数计算KMeans模型的Silhouette距离

R Using Purrr Map function to calculate Silhouette distances of KMeans model

简单的 KMeans 模型

数据

library(tidyverse)
library(broom)
library(cluster)
set.seed(42)
data = tibble(x = c(rnorm(20, 5, 1), rnorm(20, 10, 3)),
              y = c(rnorm(20, 5, 1), rnorm(20, 10, 3)))

型号

kmeans_k2 <- kmeans(data, 2, 9)

Tidymodel 结果

tidy(kmeans_k2)
# A tibble: 2 x 5
      x     y  size withinss cluster
  <dbl> <dbl> <int>    <dbl> <fct>  
1  9.60 10.8     18    255.  1      
2  5.22  5.20    22     76.6 2      
glance(kmeans_k2)
# A tibble: 1 x 4
  totss tot.withinss betweenss  iter
  <dbl>        <dbl>     <dbl> <int>
1  836.         331.      505.     1
augment(kmeans_k2, data)
# A tibble: 40 x 3
       x     y .cluster
   <dbl> <dbl> <fct>   
 1  6.37  5.21 2       
 2  4.44  4.64 2       
 3  5.36  5.76 2       
 4  5.63  4.27 2       
 5  5.40  3.63 2       
 6  4.89  5.43 2       
 7  6.51  4.19 2       
 8  4.91  6.44 2       
 9  7.02  4.57 2       
10  4.94  5.66 2       
# ... with 30 more rows

剪影计算

sil_k2 <- silhouette(kmeans_k2$cluster, dist(data))
summary(sil_k2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = kmeans_k2$cluster, dist = dist(data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 
summary(sil_k2)$si.summary[4]
     Mean 
0.5466503 

嵌套的 tibble 使用 Purrr 的映射函数为多个不同的 k 创建结果

kmeans_k123 <- tibble(k = 1:3) %>%
  mutate(km_model = map(k, ~kmeans(data, .x)),
         tidydata = map(km_model, tidy),
         glancedata = map(km_model, glance),
         augmentdata = map(km_model, augment, data))
kmeans_k123
# A tibble: 3 x 5
      k km_model tidydata         glancedata       augmentdata      
  <int> <list>   <list>           <list>           <list>           
1     1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
2     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
3     3 <kmeans> <tibble [3 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
pluck(kmeans_k23, 3, 2)
# A tibble: 2 x 5
      x     y  size withinss cluster
  <dbl> <dbl> <int>    <dbl> <fct>  
1  5.63  5.10    26     117. 1      
2 11.3  10.7     14     176. 2      

问题是,如何将 Silhouette 分数添加到嵌套的 tibble 中?Silhouette 函数需要每个模型的聚类,但我不确定该怎么做。显然我可以摘出单个实例,例如

data_k2cluster <- pluck(kmeans_k123, 2, 2)$cluster
data_k2cluster
 [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
sil_k2v2 <- silhouette(data_k2cluster, dist(data))
summary(sil_k2v2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = data_k2cluster, dist = dist(data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 

但是当我尝试将它与地图一起使用时它不起作用

kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, silhouette))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, silhouette)`.
i Input `sildata` must be size 3 or 1, not 0.

我可以创建一个函数,它再次适用于单次事件

my_fn <- function(f_cluster, f_data){my_fn <- silhouette(f_cluster, dist(f_data))}
summary(my_fn(kmeans_k2$cluster, data))
Silhouette of 40 units in 2 clusters from silhouette.default(x = f_cluster, dist = dist(f_data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 

但是当我将它与地图一起使用时失败了。

kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, my_fn))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, my_fn)`.
i Input `sildata` must be size 3 or 1, not 0.

我怀疑问题与我尝试从嵌套模型中检索 $cluster 的方式有关,因为我尝试提取它以创建它自己的列,但也无法使其工作。

把它作为一个答案,因为注释实际上不允许大量代码。

以下对我有用:

kmeans_k123 <- tibble(k = 1:3) %>%
  mutate(km_model = map(k, ~kmeans(data, .x)),
         tidydata = map(km_model, tidy),
         glancedata = map(km_model, glance),
         augmentdata = map(km_model, augment, data),
         silhouettedata = map(augmentdata, ~ silhouette(as.numeric(levels(.x$.cluster))[.x$.cluster], dist(data))))

unnest(kmeans_k123, silhouettedata)

# A tibble: 81 x 6
       k km_model tidydata         glancedata       augmentdata       silhouettedata[,"cluster"] [,"neighbor"] [,"sil_width"]
   <int> <list>   <list>           <list>           <list>                                 <dbl>         <dbl>          <dbl>
 1     1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                         NA            NA         NA    
 2     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.743
 3     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.776
 4     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.772
 5     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.771
 6     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.742
 7     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.794
 8     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.723
 9     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.713
10     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.683

关于as.numeric(levels(.x$.cluster))[.x$.cluster]的用法,这是因为broom::tidy()将簇变量变成了因子,而cluster::silhouette()要求簇变量为数值型。 This answer 提供了您使用该特定代码行将数字因子转换为数字变量的原因。