R:除了 nstart 和 iter.max 的不同设置外,k-means 中的相同集群
R: Same clusters in k-means besides different settings for nstart and iter.max
尽管我对 kmeans()
的 iter.max
和 nstart
使用(非常)不同的设置,但为什么我得到相同的聚类?
set.seed(1)
ff_1 <- kmeans(faithful, 2, iter.max = 1, nstart = 1)
set.seed(1)
ff_2 <- kmeans(faithful, 2, iter.max = 2, nstart = 1)
set.seed(1)
ff_300 <- kmeans(faithful, 2, iter.max = 300, nstart = 300)
identical(ff_1, ff_2) # TRUE
identical(ff_1, ff_300) # TRUE
我的真正目标是通过比较一次迭代的聚类与 2、3 或 10 次迭代的聚类来可视化 k-means 聚类的收敛(用于教育目的)。这就是我加入 set.seed
行的原因。
kmeans
的初始质心是随机选择的,因为
(1) 您在所有情况下都选择了相同的随机种子 = 1(这将强制为所有情况选择完全相同的质心)并且
(2) 簇是完全可分离的,您在这些情况下得到相同的结果(在第一次迭代后收敛发生得非常快)。
如下图所示
library(grid)
library(gridExtra)
library(ggplot2)
set.seed(1)
ff_1 <- kmeans(faithful, 2, iter.max = 1, nstart = 1)
set.seed(1)
ff_2 <- kmeans(faithful, 2, iter.max = 2, nstart = 1)
set.seed(1)
ff_300 <- kmeans(faithful, 2, iter.max = 300, nstart = 300)
grid.arrange(
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_1$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_1$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 1\n", color = "ff1 cluster\n"),
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_2$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_2$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 1\n", color = "ff2 cluster\n"),
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_300$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_300$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 1\n", color = "ff300 cluster\n"))
identical(ff_1, ff_2) # TRUE
identical(ff_1, ff_300) # TRUE
现在,让我们改变种子,迫使kmeans
选择不同的初始质心,结果会有所不同,如下图所示。
set.seed(1)
ff_1 <- kmeans(faithful, 2, iter.max = 1, nstart = 1)
set.seed(12)
ff_2 <- kmeans(faithful, 2, iter.max = 2, nstart = 1)
set.seed(123)
ff_300 <- kmeans(faithful, 2, iter.max = 300, nstart = 300)
grid.arrange(
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_1$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_1$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 1\n", color = "ff1 cluster\n"),
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_2$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_2$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 12\n", color = "ff2 cluster\n"),
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_300$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_300$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 123\n", color = "ff300 cluster\n"))
identical(ff_1, ff_2) # FALSE
identical(ff_1, ff_300) # FALSE
尽管我对 kmeans()
的 iter.max
和 nstart
使用(非常)不同的设置,但为什么我得到相同的聚类?
set.seed(1)
ff_1 <- kmeans(faithful, 2, iter.max = 1, nstart = 1)
set.seed(1)
ff_2 <- kmeans(faithful, 2, iter.max = 2, nstart = 1)
set.seed(1)
ff_300 <- kmeans(faithful, 2, iter.max = 300, nstart = 300)
identical(ff_1, ff_2) # TRUE
identical(ff_1, ff_300) # TRUE
我的真正目标是通过比较一次迭代的聚类与 2、3 或 10 次迭代的聚类来可视化 k-means 聚类的收敛(用于教育目的)。这就是我加入 set.seed
行的原因。
kmeans
的初始质心是随机选择的,因为
(1) 您在所有情况下都选择了相同的随机种子 = 1(这将强制为所有情况选择完全相同的质心)并且
(2) 簇是完全可分离的,您在这些情况下得到相同的结果(在第一次迭代后收敛发生得非常快)。
如下图所示
library(grid)
library(gridExtra)
library(ggplot2)
set.seed(1)
ff_1 <- kmeans(faithful, 2, iter.max = 1, nstart = 1)
set.seed(1)
ff_2 <- kmeans(faithful, 2, iter.max = 2, nstart = 1)
set.seed(1)
ff_300 <- kmeans(faithful, 2, iter.max = 300, nstart = 300)
grid.arrange(
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_1$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_1$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 1\n", color = "ff1 cluster\n"),
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_2$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_2$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 1\n", color = "ff2 cluster\n"),
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_300$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_300$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 1\n", color = "ff300 cluster\n"))
identical(ff_1, ff_2) # TRUE
identical(ff_1, ff_300) # TRUE
现在,让我们改变种子,迫使kmeans
选择不同的初始质心,结果会有所不同,如下图所示。
set.seed(1)
ff_1 <- kmeans(faithful, 2, iter.max = 1, nstart = 1)
set.seed(12)
ff_2 <- kmeans(faithful, 2, iter.max = 2, nstart = 1)
set.seed(123)
ff_300 <- kmeans(faithful, 2, iter.max = 300, nstart = 300)
grid.arrange(
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_1$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_1$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 1\n", color = "ff1 cluster\n"),
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_2$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_2$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 12\n", color = "ff2 cluster\n"),
ggplot(faithful, aes(eruptions, waiting, col=as.factor(ff_300$cluster))) + geom_point() +
geom_point(data=as.data.frame(ff_300$centers), aes(eruptions, waiting), col='black', pch='*', cex=15) +
labs(title = "kmeans seed 123\n", color = "ff300 cluster\n"))
identical(ff_1, ff_2) # FALSE
identical(ff_1, ff_300) # FALSE