有时在研究簇的 sd 时返回 NA,有时 none

Sometimes a NA is returned when stuying the sd of clusters, sometimes none

我有一系列使用 kmeans 聚类的观察结果。 然后,我调查每个集群内的标准偏差 (sd) 并获得最大值。

如果我 运行 多次使用相同的代码,有时会出现 NA。

减少k_n(簇数)效果相同,增加效果更差; with/without na.rm=T 不变。

有人可以解释我做错了什么吗?

代码:

k_n <-11
clusters=as.data.frame(kmeans(ex,k_n, nstart=50,iter.max = 15 )$cluster)
clusters<-cbind(clusters,ex)
temp<-sapply(1:k_n, function(k){temp=subset(clusters, clusters[,1]==k)
                                sd<-sd(temp[,2], na.rm = T)
                                return(sd)})
max(temp)
temp

这是三个 运行 的结果。如您所见,第三个试验 returns 不适用,另外两个不适用。

这里是数据 "ex":

1400 1400 2000 2000 2000 2000 2001 1400 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 1400 1400 1401 2000 2000 2000 2000 2000 2000 2000 1401 1401 1401 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 2000 2000 2000 1400 1400 2000 2000 2000 1200 1600 1500 1960 1350 1900 1900 1900 1350 2000 2000 2000 2000 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 1900 1350 1900 1900 1350 1350 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1920 1350 1350 1900 1900 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 1900 1350 1900 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1900 1901 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 2000 1350 1900 1900 1900 1350 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1400 1350 1350 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 0 0 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 1500 1500 1500 1500 1500 1500 1400 1400 1901 1901 1901 1901 1901 1901 1901 1901 1901年1901年1350 1900 1900 1900 1350 1350 2000 2000 2000 2000 1500 1560 1900 1900 1900 1900 1900 1900 1400 2000 1350 1900 1900 1900 1350 2000 2000 2000 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 2000 1900 1900 1900 1900 1900 1900 1900 1900 2000 1350 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1900 1900 1350 1350 1350 1350 1350 1900 2000 1350 1900 1900 1900 1900 1900 1900 1350 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1900 1900 801 1400 1400 1400 1400 1901 1901 1900 1901 1901 1900 1901 1901 1901 1900 1900 1900 1900 1901 1900 1900 1901 1901 1900 1901 1901 1901 1350 1350 1350 1350 1350 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 1401 1401 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1100 1100 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1800 2000 1800 1800 1400 1200 1400 1600 1800 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000

你得到 NA for sd 因为集群中只有一个成员。

使用你的例子:

set.seed(1)
clus = kmeans(ex,k_n, nstart=50,iter.max = 15 )

使用您的代码:

clusters=as.data.frame(clus$cluster)
clusters<-cbind(clusters,ex)
temp<-sapply(1:k_n, function(k){temp=subset(clusters, clusters[,1]==k)
                                sd<-sd(temp[,2], na.rm = T)
                                return(sd)})

> temp
 [1]  0.40191848  4.10391341  0.00000000  0.06097108  1.57499631 12.59800291
 [7]  0.00000000  0.00000000          NA  0.00000000  0.00000000

其中一个是 NA,如果你看看哪个集群给你 NA,这个集群只有一个成员:

clusters[clusters[,1]==which(is.na(temp)),]
   clus$cluster   ex
69            9 1960

如果我们查看您的数据:

table(ex)
ex
   0  801 1100 1200 1350 1400 1401 1500 1560 1600 1800 1900 1901 1920 1960 2000 
   2    2    2  123   36   21    5   22    1   94    4  147   18    1    1  268 
2001 
   1

我认为如果你增加 k,你可能会得到一个只有 1 个允许收敛的成员的集群。

我建议的一种方法是增加启动次数:

STARTS = seq(50,500,by=50)
# we test over 50 reps, how many single clusters we get
n_equal_one = sapply(STARTS,function(S){
replicate(50,sum(kmeans(ex,k_n, nstart=S,iter.max = 15 )$size==1))
})

plot(STARTS,colMeans(n_equal_one),ylab="Average proportion of singleton cluster")

所以如果你尝试 nstart = 400 或 500,你将避免单例(n=1 的集群),但如果你的数据变得更稀疏,它可能是不可避免的..

dput(ex)

c(1400, 1400, 2000, 2000, 2000, 2001, 1400, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1400, 1400, 1401, 
2000, 2000, 2000, 2000, 1401, 1401, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 801, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 1400, 1400, 2000, 2000, 2000, 1400, 1400, 2000, 2000, 2000, 
1200, 1600, 1500, 1960, 1350, 1900, 1900, 1900, 1350, 2000, 2000, 
2000, 2000, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 
1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 
1200, 2000, 2000, 2000, 2000, 2000, 2000, 1900, 1350, 1900, 1900, 
1350, 1350, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1200, 1200, 
1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 
1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1920, 1350, 1350, 
1900, 1900, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 
1600, 1600, 1600, 1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 
1200, 1200, 1200, 1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
1900, 1350, 1900, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 
1600, 1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 
1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
1900, 1901, 2000, 2000, 2000, 2000, 1200, 1200, 1200, 1200, 1200, 
1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 
1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 
1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1900, 
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
1900, 1500, 1500, 1500, 1500, 2000, 1350, 1900, 1900, 1900, 1350, 
2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1500, 1500, 1400, 
1350, 1350, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000, 
2000, 1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 
1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000, 2000, 2000, 
2000, 2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
1900, 1900, 1900, 1900, 1900, 1500, 1500, 1500, 1500, 0, 0, 1900, 
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
1900, 1900, 1500, 1500, 1500, 2000, 2000, 2000, 2000, 2000, 1200, 
1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 
1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 
2000, 2000, 2000, 2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900, 
1500, 1500, 1500, 1400, 1901, 1901, 1901, 1901, 2000, 1350, 1900, 
1900, 1900, 1350, 1350, 2000, 2000, 2000, 2000, 1500, 1560, 1900, 
1900, 1900, 1900, 1900, 1900, 1400, 2000, 1350, 1900, 1900, 1900, 
1350, 2000, 2000, 2000, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 
1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1400, 1900, 
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
1900, 1900, 1900, 1900, 1900, 1900, 1500, 1500, 1500, 1500, 2000, 
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 2000, 1350, 1900, 
1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1600, 
1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 
1600, 1900, 1900, 1350, 1350, 1350, 1350, 1350, 1900, 2000, 1350, 
1900, 1900, 1900, 1900, 1900, 1900, 1350, 1350, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 1200, 1200, 1200, 1200, 
1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 
1600, 1600, 1600, 1600, 1600, 1900, 1900, 801, 1400, 1400, 1400, 
1400, 1901, 1901, 1900, 1901, 1901, 1900, 1901, 1901, 1901, 1900, 
1900, 1900, 1900, 1901, 1900, 1900, 1901, 1901, 1900, 1901, 1901, 
1901, 1350, 1350, 1350, 1350, 1350, 1350, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 1400, 1400, 1401, 1401, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1100, 1100, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 1800, 2000, 1800, 1800, 1400, 1200, 1400, 
1600, 1800, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
)