如何将预测结果中的聚类id转换为R中k均值聚类预测模型中的class标签？

Question

我正在玩 this dataset (credit card fraud) 并尝试使用 k-means 聚类训练 prediction model。但是 pre_model 结果由 cluster id 标记。因此，当我尝试通过 confusionMatrix 评估模型的性能时，它会弹出一个错误提示 testing data and predicted result are not in same level。那么，如何将预测结果中的 label 转换为 1 或 0 中的 testdata$Class （是否欺诈）而不是 cluster id？谢谢！

代码：

data = read.csv("creditcard.csv")
data$Class <- as.factor(data$Class)
data_split <- createDataPartition(data$Class, times=1, p=0.8, list=F)
train_data <- data[data_split ,]
testdata <- data[-data_split ,]
scaled_data <- scale(train_data[-c(31)])
scaled_data <- as.matrix(scaled_data)
clust <- kmeans(scaled_data, centers = 9, nstart = 25)

pre_model <- cl_predict(clust , testdata)
confusionMatrix(testdata$Class,  pre_model , positive='1')

K 均值聚类结果：

> Clustering vector:
   [1] 7 7 6 6 7 7 8 8 3 8 6 3 6 7 7 8 7 8 7 6 6 7 7 7 3 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 3 7 7 7 8 7
  [47] 8 7 6 7 7 1 7 7 8 7 7 3 7 7 8 6 7 3 7 7 8 7 7 7 7 3 7 7 7 7 7 7 7 3 7 7 9 6 7 7 7 7 3 1 7 7
  [93] 7 7 3 7 7 7 7 7 7 7 7 7 8 3 7 7 7 8 7 7 7 7 7 7 7 7 7 7 8 7 6 8 8 7 8 8 8 7 8 7 7 7 7 7 6 8
 [139] 7 8 8 6 7 3 8 3 9 7 7 7 8 7 7 8 7 3 7 7 7 7 6 7 7 7 1 7 7 7 7 7 7 6 7 7 8 7 7 7 8 6 7 8 7 8....

ConfusionMatrix 错误：

Error: data and reference should be factors with the same levels.

Answer 1

由于聚类分析是一种无监督方法，因此您的分析很复杂。也就是说，它不会尝试预测原始的集群分配。它只是根据自变量对数据进行分组。让我们使用 iris 数据集跟踪您的代码，该数据集只有 3 个组，而不是您的数据似乎具有的 9 个组。

library(caret)
library(clue)
set.seed(42)
data(iris)
Class <- iris$Species
iris.z <- scale(iris[, -5])
iris_split <- createDataPartition(iris$Species, times=1, p=0.8, list=FALSE)
iris_train <- iris.z[iris_split, ]
iris_test <- iris.z[-iris_split, ]
Class_train <- Class[iris_split]
Class_test <- Class[-iris_split]

这与您的代码非常接近，只是做了一些小的调整。我们加载您未包含在代码中的必要包 caret 和 clue。我们为随机数生成器设置种子，因为 kmeans 使用随机初始分配，因此结果可能与下一个运行不同。其次，我们缩放所有数据，使训练和测试数据集处于相同的比例，然后我们创建原始数据和组成员的训练和测试子集。现在进行聚类分析：

Clust_train <- kmeans(iris_train, centers=3, nstart=25)
table(Clust_train$cluster, Class_train)
#    Class_train
#     setosa versicolor virginica
#   1      0          9        31
#   2      0         31         9
#   3     40          0         0

请注意，不能保证集群与原始组名称匹配。使用 3 clusters/groups 可以直接识别第 1 簇主要由 virginica、第 2 簇云芝和第 3 簇 setosa 组成。然而，第 1 组还包括 9 个云芝属标本，第 2 组也包括 9 个维吉尼亚属标本。对于 9 个集群，结果可能不会那么简单。

接下来将簇号转换为最可能的物种名称：

train_pre <- factor(ifelse(Clust_train$cluster==1, 3, ifelse(Clust_train$cluster==2, 2, 1)), labels=levels(iris$Species))
tbl_train <- table(train_pre, Class_train)
tbl_train
#             Class_train
# train_pre    setosa versicolor virginica
#   setosa         40          0         0
#   versicolor      0         31         9
#   virginica       0          9        31
sum(diag(tbl_train))/sum(tbl_train) * 100
# [1] 85

因此，训练聚类分析将同一物种的标本分组在一起的准确率约为 85%。现在将测试组分配给集群：

pre_model <- cl_predict(Clust_train, iris_test)
test_pre <- factor(ifelse(pre_model==1, 3, ifelse(pre_model==2, 2, 1)), labels=levels(iris$Species))
confusionMatrix(Class_test,  test_pre, positive='1')
# Confusion Matrix and Statistics
# 
#             Reference
# Prediction   setosa versicolor virginica
#   setosa         10          0         0
#   versicolor      0          8         2
#   virginica       0          5         5
#           .  .  .

如何将预测结果中的聚类id转换为R中k均值聚类预测模型中的class标签？

How to convert the cluster id in prediction result to class label in k-means clustering prediction model in R?

validation

r

cluster-analysis

k-means

confusion-matrix