k 均值和 elbow 方法为相同的数据和相同的中心生成不同的图形。

Question

我正在尝试从 eblow 图中找到最佳簇大小。问题是每次我运行它产生不同图形的代码。

    par(mfrow=c(2,2))
    for(i in 1:4){
      data <- read.csv(file = "C:/Users/sd0298/Desktop/data.csv", header =  TRUE)
      wss <- (nrow(data)-1)*sum(apply(data,2,var))
      for (i in 2:15) {
        wss[i] <- sum(kmeans(data, centers=i, iter.max = 500, nstart = 1, algorithm  = "Lloyd" , trace =  TRUE)$withinss)
      }
      plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares", main="SSE vs Cluster levels",cex.axis = 0.8)

    }

此外，每次当我尝试为同一中心绘制集群并使用相同的数据时，它都会产生不同的图形。

    par(mfrow=c(2,2))
    for(i in 1:4){
      data <- read.csv(file = "C:/Users/sd0298/Desktop/data.csv", header =  TRUE)
      wss <- (nrow(data)-1)*sum(apply(data,2,var))
      km <- kmeans(data, centers=4, iter.max = 500, nstart = 1, algorithm  = "Lloyd" , trace =  TRUE)  
      clusplot(data, km$cluster, color=TRUE, shade=T, span=T, col.p = c("#666666"),   lines=0 ,plotchar=F,  sub = "" ,main = "", labels=5)            
    }

任何人都可以告诉我出了什么问题，并告诉我如何在中心和数据没有改变的情况下重现同一个集群。

Answer 1

您正在对 R kmeans() 函数进行以下调用：

km <- kmeans(data, centers=4, iter.max = 500, nstart = 1, algorithm  = "Lloyd" , trace =  TRUE)

Wikipedia page for Lloyd's k-means algorithm 声明如下：

Lloyd's algorithm starts by an initial placement of some number k of point sites in the input domain. In mesh smoothing applications, these would be the vertices of the mesh to be smoothed; in other applications they may be placed at random, or by intersecting a uniform triangular mesh of the appropriate size with the input domain.

R 的 kmeans 使用随机初始条件。 nstart 参数控制尝试随机初始化的次数。换句话说，如果您运行 Lloyd 算法多次，您得到的聚类可能不会完全相同。

但是，您可以将这种不确定性行为视为验证集群准确性的机会。如果你运行 Lloyd's 好几次，并且你不断得到相似的集群，那么这就意味着这些集群是有意义的。 如果运行多次使用k-means得到非常不同的结果，表明它们不可靠。

k 均值和 elbow 方法为相同的数据和相同的中心生成不同的图形。

k means and elbow method produces different graph for same data and same center.

r

k-means