R使用Purrr Map函数计算KMeans模型的Silhouette距离
R Using Purrr Map function to calculate Silhouette distances of KMeans model
简单的 KMeans 模型
数据
library(tidyverse)
library(broom)
library(cluster)
set.seed(42)
data = tibble(x = c(rnorm(20, 5, 1), rnorm(20, 10, 3)),
y = c(rnorm(20, 5, 1), rnorm(20, 10, 3)))
型号
kmeans_k2 <- kmeans(data, 2, 9)
Tidymodel 结果
tidy(kmeans_k2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 9.60 10.8 18 255. 1
2 5.22 5.20 22 76.6 2
glance(kmeans_k2)
# A tibble: 1 x 4
totss tot.withinss betweenss iter
<dbl> <dbl> <dbl> <int>
1 836. 331. 505. 1
augment(kmeans_k2, data)
# A tibble: 40 x 3
x y .cluster
<dbl> <dbl> <fct>
1 6.37 5.21 2
2 4.44 4.64 2
3 5.36 5.76 2
4 5.63 4.27 2
5 5.40 3.63 2
6 4.89 5.43 2
7 6.51 4.19 2
8 4.91 6.44 2
9 7.02 4.57 2
10 4.94 5.66 2
# ... with 30 more rows
剪影计算
sil_k2 <- silhouette(kmeans_k2$cluster, dist(data))
summary(sil_k2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = kmeans_k2$cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
summary(sil_k2)$si.summary[4]
Mean
0.5466503
嵌套的 tibble 使用 Purrr 的映射函数为多个不同的 k 创建结果
kmeans_k123 <- tibble(k = 1:3) %>%
mutate(km_model = map(k, ~kmeans(data, .x)),
tidydata = map(km_model, tidy),
glancedata = map(km_model, glance),
augmentdata = map(km_model, augment, data))
kmeans_k123
# A tibble: 3 x 5
k km_model tidydata glancedata augmentdata
<int> <list> <list> <list> <list>
1 1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
2 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
3 3 <kmeans> <tibble [3 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
pluck(kmeans_k23, 3, 2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 5.63 5.10 26 117. 1
2 11.3 10.7 14 176. 2
问题是,如何将 Silhouette 分数添加到嵌套的 tibble 中?Silhouette 函数需要每个模型的聚类,但我不确定该怎么做。显然我可以摘出单个实例,例如
data_k2cluster <- pluck(kmeans_k123, 2, 2)$cluster
data_k2cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
sil_k2v2 <- silhouette(data_k2cluster, dist(data))
summary(sil_k2v2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = data_k2cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
但是当我尝试将它与地图一起使用时它不起作用
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, silhouette))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, silhouette)`.
i Input `sildata` must be size 3 or 1, not 0.
我可以创建一个函数,它再次适用于单次事件
my_fn <- function(f_cluster, f_data){my_fn <- silhouette(f_cluster, dist(f_data))}
summary(my_fn(kmeans_k2$cluster, data))
Silhouette of 40 units in 2 clusters from silhouette.default(x = f_cluster, dist = dist(f_data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
但是当我将它与地图一起使用时失败了。
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, my_fn))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, my_fn)`.
i Input `sildata` must be size 3 or 1, not 0.
我怀疑问题与我尝试从嵌套模型中检索 $cluster 的方式有关,因为我尝试提取它以创建它自己的列,但也无法使其工作。
把它作为一个答案,因为注释实际上不允许大量代码。
以下对我有用:
kmeans_k123 <- tibble(k = 1:3) %>%
mutate(km_model = map(k, ~kmeans(data, .x)),
tidydata = map(km_model, tidy),
glancedata = map(km_model, glance),
augmentdata = map(km_model, augment, data),
silhouettedata = map(augmentdata, ~ silhouette(as.numeric(levels(.x$.cluster))[.x$.cluster], dist(data))))
unnest(kmeans_k123, silhouettedata)
# A tibble: 81 x 6
k km_model tidydata glancedata augmentdata silhouettedata[,"cluster"] [,"neighbor"] [,"sil_width"]
<int> <list> <list> <list> <list> <dbl> <dbl> <dbl>
1 1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> NA NA NA
2 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.743
3 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.776
4 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.772
5 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.771
6 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.742
7 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.794
8 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.723
9 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.713
10 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.683
关于as.numeric(levels(.x$.cluster))[.x$.cluster]
的用法,这是因为broom::tidy()
将簇变量变成了因子,而cluster::silhouette()
要求簇变量为数值型。 This answer 提供了您使用该特定代码行将数字因子转换为数字变量的原因。
简单的 KMeans 模型
数据
library(tidyverse)
library(broom)
library(cluster)
set.seed(42)
data = tibble(x = c(rnorm(20, 5, 1), rnorm(20, 10, 3)),
y = c(rnorm(20, 5, 1), rnorm(20, 10, 3)))
型号
kmeans_k2 <- kmeans(data, 2, 9)
Tidymodel 结果
tidy(kmeans_k2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 9.60 10.8 18 255. 1
2 5.22 5.20 22 76.6 2
glance(kmeans_k2)
# A tibble: 1 x 4
totss tot.withinss betweenss iter
<dbl> <dbl> <dbl> <int>
1 836. 331. 505. 1
augment(kmeans_k2, data)
# A tibble: 40 x 3
x y .cluster
<dbl> <dbl> <fct>
1 6.37 5.21 2
2 4.44 4.64 2
3 5.36 5.76 2
4 5.63 4.27 2
5 5.40 3.63 2
6 4.89 5.43 2
7 6.51 4.19 2
8 4.91 6.44 2
9 7.02 4.57 2
10 4.94 5.66 2
# ... with 30 more rows
剪影计算
sil_k2 <- silhouette(kmeans_k2$cluster, dist(data))
summary(sil_k2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = kmeans_k2$cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
summary(sil_k2)$si.summary[4]
Mean
0.5466503
嵌套的 tibble 使用 Purrr 的映射函数为多个不同的 k 创建结果
kmeans_k123 <- tibble(k = 1:3) %>%
mutate(km_model = map(k, ~kmeans(data, .x)),
tidydata = map(km_model, tidy),
glancedata = map(km_model, glance),
augmentdata = map(km_model, augment, data))
kmeans_k123
# A tibble: 3 x 5
k km_model tidydata glancedata augmentdata
<int> <list> <list> <list> <list>
1 1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
2 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
3 3 <kmeans> <tibble [3 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
pluck(kmeans_k23, 3, 2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 5.63 5.10 26 117. 1
2 11.3 10.7 14 176. 2
问题是,如何将 Silhouette 分数添加到嵌套的 tibble 中?Silhouette 函数需要每个模型的聚类,但我不确定该怎么做。显然我可以摘出单个实例,例如
data_k2cluster <- pluck(kmeans_k123, 2, 2)$cluster
data_k2cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
sil_k2v2 <- silhouette(data_k2cluster, dist(data))
summary(sil_k2v2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = data_k2cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
但是当我尝试将它与地图一起使用时它不起作用
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, silhouette))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, silhouette)`.
i Input `sildata` must be size 3 or 1, not 0.
我可以创建一个函数,它再次适用于单次事件
my_fn <- function(f_cluster, f_data){my_fn <- silhouette(f_cluster, dist(f_data))}
summary(my_fn(kmeans_k2$cluster, data))
Silhouette of 40 units in 2 clusters from silhouette.default(x = f_cluster, dist = dist(f_data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
但是当我将它与地图一起使用时失败了。
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, my_fn))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, my_fn)`.
i Input `sildata` must be size 3 or 1, not 0.
我怀疑问题与我尝试从嵌套模型中检索 $cluster 的方式有关,因为我尝试提取它以创建它自己的列,但也无法使其工作。
把它作为一个答案,因为注释实际上不允许大量代码。
以下对我有用:
kmeans_k123 <- tibble(k = 1:3) %>%
mutate(km_model = map(k, ~kmeans(data, .x)),
tidydata = map(km_model, tidy),
glancedata = map(km_model, glance),
augmentdata = map(km_model, augment, data),
silhouettedata = map(augmentdata, ~ silhouette(as.numeric(levels(.x$.cluster))[.x$.cluster], dist(data))))
unnest(kmeans_k123, silhouettedata)
# A tibble: 81 x 6
k km_model tidydata glancedata augmentdata silhouettedata[,"cluster"] [,"neighbor"] [,"sil_width"]
<int> <list> <list> <list> <list> <dbl> <dbl> <dbl>
1 1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> NA NA NA
2 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.743
3 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.776
4 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.772
5 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.771
6 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.742
7 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.794
8 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.723
9 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.713
10 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.683
关于as.numeric(levels(.x$.cluster))[.x$.cluster]
的用法,这是因为broom::tidy()
将簇变量变成了因子,而cluster::silhouette()
要求簇变量为数值型。 This answer 提供了您使用该特定代码行将数字因子转换为数字变量的原因。