在 R 中聚类海浪数据
Clustering sea waves data in R
我已经在 R 中使用不同的聚类方法(kmeans、hclust、agnes、funny)对风暴的能量数据进行聚类,但即使很容易为我的工作选择最佳方法,我也需要一个计算(和非理论)方法通过结果比较和评估方法。你相信有什么东西吗?
提前致谢,
感谢您的提问,我了解到您可以使用 factoextra
包
中的 eclust
函数计算最佳簇数
使用来自 here
的 kmeans
演示
# Load and scale the dataset
data("USArrests")
DF <- scale(USArrests)
When data is not scaledd the clustering results might not be reliable [example](http://stats.stackexchange.com/questions/140711/why-does-gap-statistic-for-k-means-suggest-one-cluster-even-though-there-are-ob)
library("factoextra")
# Enhanced k-means clustering
res.km <- eclust(DF, "kmeans")
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
聚类函数比较:
您可以使用所有可用的方法并计算最佳聚类数:
clusterFuncList = c("kmeans", "pam", "clara", "fanny", "hclust", "agnes" ,"diana")
resultList <- sapply(clusterFuncList,function(x) {
cat("Begin clustering for function:",x,"\n")
#For each clustering function find optimal number of clusters, to disable plotting use graph=FALSE
clustObj = eclust(DF, x,graph=FALSE)
#return optimal number of clusters for each clustering function
cat("End clustering for function:",x,"\n\n\n")
resultDF = data.frame(clustFunc = x, optimalNumbClusters = clustObj$nbclust,stringsAsFactors=FALSE)
})
# >resultList
# clustFunc optimalNumbClusters
# 1 kmeans 4
# 2 pam 4
# 3 clara 5
# 4 fanny 5
# 5 hclust 4
# 6 agnes 4
# 7 diana 4
差距统计,即拟合优度度量:
"gap statistic" 用作聚类算法的拟合优度度量,请参阅 paper
对于固定数量的用户定义的聚类,我们可以将每个聚类算法的差距统计与 cluster
包中的 clusGap
函数进行比较:
numbClusters = 5
library(cluster)
clusterFuncFixedK = c("kmeans", "pam", "clara", "fanny")
gapStatList <- do.call(rbind,lapply(clusterFuncFixedK,function(x) {
cat("Begin clustering for function:",x,"\n")
set.seed(42)
#For each clustering function compute gap statistic
gapStatBoot=clusGap(DF,FUNcluster=get(x),K.max=numbClusters)
gapStatVec= round(gapStatBoot$Tab[,"gap"],3)
gapStat_at_AllClusters = paste(gapStatVec,collapse=",")
gapStat_at_chosenCluster = gapStatVec[numbClusters]
#return gap statistic for each clustering function
cat("End clustering for function:",x,"\n\n\n")
resultDF = data.frame(clustFunc = x, gapStat_at_AllClusters = gapStat_at_AllClusters,gapStat_at_chosenCluster = gapStat_at_chosenCluster, stringsAsFactors=FALSE)
}))
# >gapStatList
# clustFunc gapStat_at_AllClusters gapStat_at_chosenCluster
#1 kmeans 0.184,0.235,0.264,0.233,0.27 0.270
#2 pam 0.181,0.253,0.274,0.307,0.303 0.303
#3 clara 0.181,0.253,0.276,0.311,0.315 0.315
#4 fanny 0.181,0.23,0.313,0.351,0.478 0.478
上面的table有每个算法在k=1到5的每个clutser的gap统计。第3列,gapStat_at_chosenCluster
有
k = 5 簇的间隙统计。统计数据越低,分区越好,因此,在 k = 5 个集群时,kmeans
表现更好
相对于 USArrests
数据集
上的其他算法
我已经在 R 中使用不同的聚类方法(kmeans、hclust、agnes、funny)对风暴的能量数据进行聚类,但即使很容易为我的工作选择最佳方法,我也需要一个计算(和非理论)方法通过结果比较和评估方法。你相信有什么东西吗?
提前致谢,
感谢您的提问,我了解到您可以使用 factoextra
包
eclust
函数计算最佳簇数
使用来自 here
的kmeans
演示
# Load and scale the dataset
data("USArrests")
DF <- scale(USArrests)
When data is not scaledd the clustering results might not be reliable [example](http://stats.stackexchange.com/questions/140711/why-does-gap-statistic-for-k-means-suggest-one-cluster-even-though-there-are-ob)
library("factoextra")
# Enhanced k-means clustering
res.km <- eclust(DF, "kmeans")
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
聚类函数比较:
您可以使用所有可用的方法并计算最佳聚类数:
clusterFuncList = c("kmeans", "pam", "clara", "fanny", "hclust", "agnes" ,"diana")
resultList <- sapply(clusterFuncList,function(x) {
cat("Begin clustering for function:",x,"\n")
#For each clustering function find optimal number of clusters, to disable plotting use graph=FALSE
clustObj = eclust(DF, x,graph=FALSE)
#return optimal number of clusters for each clustering function
cat("End clustering for function:",x,"\n\n\n")
resultDF = data.frame(clustFunc = x, optimalNumbClusters = clustObj$nbclust,stringsAsFactors=FALSE)
})
# >resultList
# clustFunc optimalNumbClusters
# 1 kmeans 4
# 2 pam 4
# 3 clara 5
# 4 fanny 5
# 5 hclust 4
# 6 agnes 4
# 7 diana 4
差距统计,即拟合优度度量:
"gap statistic" 用作聚类算法的拟合优度度量,请参阅 paper
对于固定数量的用户定义的聚类,我们可以将每个聚类算法的差距统计与 cluster
包中的 clusGap
函数进行比较:
numbClusters = 5
library(cluster)
clusterFuncFixedK = c("kmeans", "pam", "clara", "fanny")
gapStatList <- do.call(rbind,lapply(clusterFuncFixedK,function(x) {
cat("Begin clustering for function:",x,"\n")
set.seed(42)
#For each clustering function compute gap statistic
gapStatBoot=clusGap(DF,FUNcluster=get(x),K.max=numbClusters)
gapStatVec= round(gapStatBoot$Tab[,"gap"],3)
gapStat_at_AllClusters = paste(gapStatVec,collapse=",")
gapStat_at_chosenCluster = gapStatVec[numbClusters]
#return gap statistic for each clustering function
cat("End clustering for function:",x,"\n\n\n")
resultDF = data.frame(clustFunc = x, gapStat_at_AllClusters = gapStat_at_AllClusters,gapStat_at_chosenCluster = gapStat_at_chosenCluster, stringsAsFactors=FALSE)
}))
# >gapStatList
# clustFunc gapStat_at_AllClusters gapStat_at_chosenCluster
#1 kmeans 0.184,0.235,0.264,0.233,0.27 0.270
#2 pam 0.181,0.253,0.274,0.307,0.303 0.303
#3 clara 0.181,0.253,0.276,0.311,0.315 0.315
#4 fanny 0.181,0.23,0.313,0.351,0.478 0.478
上面的table有每个算法在k=1到5的每个clutser的gap统计。第3列,gapStat_at_chosenCluster
有
k = 5 簇的间隙统计。统计数据越低,分区越好,因此,在 k = 5 个集群时,kmeans
表现更好
相对于 USArrests
数据集