聚类后的聚类分配问题
Problems with cluster assignment after clustering
我在理解 k 均值聚类中的聚类分配时遇到问题。具体来说,我知道该点已分配给最近的集群(到集群中心的最短距离),但我无法重现结果。详情如下。
假设我有一个数据框 df1:
set.seed(16)
df1 = data.frame(matrix(sample(1:50, replace = T), ncol=10, nrow=10000))
head(df1, n=4)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 35 35 35 35 35 35 35 35 35 35
2 13 13 13 13 13 13 13 13 13 13
3 23 23 23 23 23 23 23 23 23 23
4 12 12 12 12 12 12 12 12 12 12
我想在该数据框上执行 k 均值聚类(带缩放):
for_clst_km = scale(df1, center=F) #standardization with z-scores
kclust = 6 #number of clusters
Clusters <- kmeans(for_clst_km, kclust)
聚类完成后,我可以将聚类分配给原始数据框:
df1$cluster = Clusters$cluster
出于测试目的,我们选择 3 号集群。
library(dplyr)
cluster3 = df1 %>% filter(cluster == 3)
因为我想先缩放 cluster3,所以我需要删除 cluster 列,然后再执行 z 标准化:
cluster3$cluster = NULL
cluster3_1 = (cluster3-colMeans(df1))/apply(df1,2,sd)
现在,当我在 cluster3_1 中缩放值时,我可以计算到每个集群中心点的距离:
centroids = data.matrix(Clusters$centers)
dist_to_clust1 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[1,])^2)))
dist_to_clust2 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[2,])^2)))
dist_to_clust3 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[3,])^2)))
dist_to_clust4 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[4,])^2)))
dist_to_clust5 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[5,])^2)))
dist_to_clust6 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[6,])^2)))
dist_to_clust = cbind(dist_to_clust1, dist_to_clust2, dist_to_clust3, dist_to_clust4, dist_to_clust5, dist_to_clust6)
最后,在观察到每个集群的距离后,很明显我做错了什么。例如,查看 第五行 我发现该点最接近 cluster 4 (例如,这是最小值)。
head(dist_to_clust)
dist_to_clust1 dist_to_clust2 dist_to_clust3 dist_to_clust4 dist_to_clust5 dist_to_clust6
[1,] 11.015929 11.116591 10.946547 11.173597 11.034535 10.968986
[2,] 13.136060 12.848511 12.967084 13.379930 12.840414 12.861085
[3,] 13.681588 13.314994 13.492713 13.942535 13.322293 13.360695
[4,] 10.506083 10.725233 10.467843 10.636465 10.621233 10.529714
[5,] 2.157906 5.392285 3.120574 1.168265 4.855553 4.197457
[6,] 11.015929 11.116591 10.946547 11.173597 11.034535 10.968986
我认为缩放方法有误。我不确定我是否真的可以用整个数据框的均值和标准差来缩放集群 3 个点。
能否请您分享您的想法,我做错了什么?
非常感谢!
您手写的缩放代码已损坏。
检查结果数据的标准偏差,它不是 1.
你为什么不直接使用
cluster3 = for_clst_km %>% filter(cluster == 3)
根据我在交叉验证时的回答:
是因为df-colmeans(df)
并没有按照你的想法去做。
让我们试试代码:
a=matrix(1:9,nrow=3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
colMeans(a)
[1] 2 5 8
a-colMeans(a)
[,1] [,2] [,3]
[1,] -1 2 5
[2,] -3 0 3
[3,] -5 -2 1
apply(a,2,function(x) x-mean(x))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
你会发现 a-colMeans(a)
做的事情与 apply(a,2,function(x) x-mean(x))
不同,这正是你想要居中的。
你可以写一个 apply
来为你做完整的自动缩放:
apply(a,2,function(x) (x-mean(x))/sd(x))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
scale(a)
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1
但是这样做没有意义,因为 scale
会为您完成。 :)
此外,要尝试聚类:
set.seed(16)
nc=10
nr=10000
# Make sure you draw enough samples: There was extreme periodicity in your sampling
df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
head(df1, n=4)
for_clst_km = scale(df1) #standardization with z-scores
nclust = 4 #number of clusters
Clusters <- kmeans(for_clst_km, nclust)
# For extracting scaled values: They are already available in for_clst_km
cluster3_sc=for_clst_km[Clusters$cluster==3,]
# Simplify code by putting distance in function
distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))
centroids=Clusters$centers
dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,]) # Calculate observation distances to centroid d=1..nclust
whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
table(whichMins) # Tabularize
> table(whichMins)
whichMins
3
2532
HTH 手,
卡尔
我在理解 k 均值聚类中的聚类分配时遇到问题。具体来说,我知道该点已分配给最近的集群(到集群中心的最短距离),但我无法重现结果。详情如下。
假设我有一个数据框 df1:
set.seed(16)
df1 = data.frame(matrix(sample(1:50, replace = T), ncol=10, nrow=10000))
head(df1, n=4)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 35 35 35 35 35 35 35 35 35 35
2 13 13 13 13 13 13 13 13 13 13
3 23 23 23 23 23 23 23 23 23 23
4 12 12 12 12 12 12 12 12 12 12
我想在该数据框上执行 k 均值聚类(带缩放):
for_clst_km = scale(df1, center=F) #standardization with z-scores
kclust = 6 #number of clusters
Clusters <- kmeans(for_clst_km, kclust)
聚类完成后,我可以将聚类分配给原始数据框:
df1$cluster = Clusters$cluster
出于测试目的,我们选择 3 号集群。
library(dplyr)
cluster3 = df1 %>% filter(cluster == 3)
因为我想先缩放 cluster3,所以我需要删除 cluster 列,然后再执行 z 标准化:
cluster3$cluster = NULL
cluster3_1 = (cluster3-colMeans(df1))/apply(df1,2,sd)
现在,当我在 cluster3_1 中缩放值时,我可以计算到每个集群中心点的距离:
centroids = data.matrix(Clusters$centers)
dist_to_clust1 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[1,])^2)))
dist_to_clust2 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[2,])^2)))
dist_to_clust3 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[3,])^2)))
dist_to_clust4 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[4,])^2)))
dist_to_clust5 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[5,])^2)))
dist_to_clust6 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[6,])^2)))
dist_to_clust = cbind(dist_to_clust1, dist_to_clust2, dist_to_clust3, dist_to_clust4, dist_to_clust5, dist_to_clust6)
最后,在观察到每个集群的距离后,很明显我做错了什么。例如,查看 第五行 我发现该点最接近 cluster 4 (例如,这是最小值)。
head(dist_to_clust)
dist_to_clust1 dist_to_clust2 dist_to_clust3 dist_to_clust4 dist_to_clust5 dist_to_clust6
[1,] 11.015929 11.116591 10.946547 11.173597 11.034535 10.968986
[2,] 13.136060 12.848511 12.967084 13.379930 12.840414 12.861085
[3,] 13.681588 13.314994 13.492713 13.942535 13.322293 13.360695
[4,] 10.506083 10.725233 10.467843 10.636465 10.621233 10.529714
[5,] 2.157906 5.392285 3.120574 1.168265 4.855553 4.197457
[6,] 11.015929 11.116591 10.946547 11.173597 11.034535 10.968986
我认为缩放方法有误。我不确定我是否真的可以用整个数据框的均值和标准差来缩放集群 3 个点。
能否请您分享您的想法,我做错了什么? 非常感谢!
您手写的缩放代码已损坏。 检查结果数据的标准偏差,它不是 1.
你为什么不直接使用
cluster3 = for_clst_km %>% filter(cluster == 3)
根据我在交叉验证时的回答:
是因为df-colmeans(df)
并没有按照你的想法去做。
让我们试试代码:
a=matrix(1:9,nrow=3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
colMeans(a)
[1] 2 5 8
a-colMeans(a)
[,1] [,2] [,3]
[1,] -1 2 5
[2,] -3 0 3
[3,] -5 -2 1
apply(a,2,function(x) x-mean(x))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
你会发现 a-colMeans(a)
做的事情与 apply(a,2,function(x) x-mean(x))
不同,这正是你想要居中的。
你可以写一个 apply
来为你做完整的自动缩放:
apply(a,2,function(x) (x-mean(x))/sd(x))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
scale(a)
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1
但是这样做没有意义,因为 scale
会为您完成。 :)
此外,要尝试聚类:
set.seed(16)
nc=10
nr=10000
# Make sure you draw enough samples: There was extreme periodicity in your sampling
df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
head(df1, n=4)
for_clst_km = scale(df1) #standardization with z-scores
nclust = 4 #number of clusters
Clusters <- kmeans(for_clst_km, nclust)
# For extracting scaled values: They are already available in for_clst_km
cluster3_sc=for_clst_km[Clusters$cluster==3,]
# Simplify code by putting distance in function
distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))
centroids=Clusters$centers
dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,]) # Calculate observation distances to centroid d=1..nclust
whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
table(whichMins) # Tabularize
> table(whichMins)
whichMins
3
2532
HTH 手,
卡尔