Kmeans:集群大小错误
Kmeans: Wrong size of clusters
我在 Heart Disease UCI 数据集上 运行 R 中的 Kmeans 算法。我应该像数据集中的那样得到 2 个大小为 138 165 的集群。
步骤:
- 将数据集存储在数据框中:
df <- read.csv(".../heart.csv",fileEncoding = "UTF-8-BOM")
- 提取特征:
features = subset(df, select = -target)
- 标准化:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
features = data.frame(sapply(features, normalize))
- 运行算法:
set.seed(0)
cluster = kmeans(features, 2)
cluster$size
输出:
[1] 99 204
为什么?
这里有一个示例,可以帮助您理清思路。
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
df <- USArrests
df <- na.omit(df)
df <- scale(df)
distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)
fviz_cluster(k2, data = df)
[![enter image description here][1]][1]
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
set.seed(123)
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(df, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15
# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
[![enter image description here][1]][1]
set.seed(123)
fviz_nbclust(df, kmeans, method = "wss")
# Compute k-means clustering with k = 4
set.seed(123)
final <- kmeans(df, 4, nstart = 25)
print(final)
fviz_cluster(final, data = df)
您似乎关注的是聚类的大小,而不是预测的准确性。您可能会得到两个大小为 (138, 165) 的簇,但不一定与数据中的 'target' 列相同。
判断性能的更好方法是预测的准确性。在您的例子中,您的模型准确率为 72%。您可以通过以下方式查看:
df$label <- cluster$cluster -1
confusionMatrix(table(df$target, df$label))
#Confusion Matrix and Statistics
#
# 0 1
# 0 76 62
# 1 23 142
#
# Accuracy : 0.7195
# ...
通过标准化数据而不是规范化,我能够获得更好的准确性。可能是因为标准化对异常值更稳健。
我还对看似分类的变量进行了虚拟编码,这似乎提高了准确性。我们现在有 85% 的准确率,集群大小更接近我们的预期 (143 160)。尽管如前所述,集群大小本身没有意义。
library(dplyr)
library(fastDummies)
library(caret)
standardize <- function(x){
num <- x - mean(x, na.rm=T)
denom <- sd(x, na.rm=T)
num/denom
}
# dummy-code and standardize
features <- select(df, -target) %>%
dummy_cols(select_columns = c('cp','thal', 'ca'),
remove_selected_columns = T,remove_first_dummy = T) %>%
mutate_all(standardize)
set.seed(0)
cluster <- kmeans(features, centers = 2, nstart = 50)
cluster$size
# 143 160
# check predictions vs actual labels
df$label <- cluster$cluster -1
confusionMatrix(table(df$target, df$label))
#Confusion Matrix and Statistics
#
#
# 0 1
# 0 117 21
# 1 26 139
#
# Accuracy : 0.8449
当然,还有其他值得考虑的准确度指标,例如样本外准确度(将数据分成训练集和测试集,并计算测试集预测的准确度)和 f1 分数。
我在 Heart Disease UCI 数据集上 运行 R 中的 Kmeans 算法。我应该像数据集中的那样得到 2 个大小为 138 165 的集群。
步骤:
- 将数据集存储在数据框中:
df <- read.csv(".../heart.csv",fileEncoding = "UTF-8-BOM")
- 提取特征:
features = subset(df, select = -target)
- 标准化:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
features = data.frame(sapply(features, normalize))
- 运行算法:
set.seed(0)
cluster = kmeans(features, 2)
cluster$size
输出:
[1] 99 204
为什么?
这里有一个示例,可以帮助您理清思路。
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
df <- USArrests
df <- na.omit(df)
df <- scale(df)
distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)
fviz_cluster(k2, data = df)
[![enter image description here][1]][1]
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
set.seed(123)
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(df, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15
# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
[![enter image description here][1]][1]
set.seed(123)
fviz_nbclust(df, kmeans, method = "wss")
# Compute k-means clustering with k = 4
set.seed(123)
final <- kmeans(df, 4, nstart = 25)
print(final)
fviz_cluster(final, data = df)
您似乎关注的是聚类的大小,而不是预测的准确性。您可能会得到两个大小为 (138, 165) 的簇,但不一定与数据中的 'target' 列相同。
判断性能的更好方法是预测的准确性。在您的例子中,您的模型准确率为 72%。您可以通过以下方式查看:
df$label <- cluster$cluster -1
confusionMatrix(table(df$target, df$label))
#Confusion Matrix and Statistics
#
# 0 1
# 0 76 62
# 1 23 142
#
# Accuracy : 0.7195
# ...
通过标准化数据而不是规范化,我能够获得更好的准确性。可能是因为标准化对异常值更稳健。
我还对看似分类的变量进行了虚拟编码,这似乎提高了准确性。我们现在有 85% 的准确率,集群大小更接近我们的预期 (143 160)。尽管如前所述,集群大小本身没有意义。
library(dplyr)
library(fastDummies)
library(caret)
standardize <- function(x){
num <- x - mean(x, na.rm=T)
denom <- sd(x, na.rm=T)
num/denom
}
# dummy-code and standardize
features <- select(df, -target) %>%
dummy_cols(select_columns = c('cp','thal', 'ca'),
remove_selected_columns = T,remove_first_dummy = T) %>%
mutate_all(standardize)
set.seed(0)
cluster <- kmeans(features, centers = 2, nstart = 50)
cluster$size
# 143 160
# check predictions vs actual labels
df$label <- cluster$cluster -1
confusionMatrix(table(df$target, df$label))
#Confusion Matrix and Statistics
#
#
# 0 1
# 0 117 21
# 1 26 139
#
# Accuracy : 0.8449
当然,还有其他值得考虑的准确度指标,例如样本外准确度(将数据分成训练集和测试集,并计算测试集预测的准确度)和 f1 分数。