'factors with the same levels' 在混淆矩阵中

'factors with the same levels' in Confusion Matrix

我正在尝试制作决策树,但是当我在最后一行制作混淆矩阵时出现此错误:

Error : `data` and `reference` should be factors with the same levels

这是我的代码:

library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)

#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)

#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)

#making sure the data is in the right format 
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))

#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)

所以我尝试按照另一个主题中的说明进行操作:

confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))

但是我还是报错了:

Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid

尽量保持 traintest 的因子水平与 df 相同。

train$`Customer type` <- factor(train$`Customer type`, unique(df$`Customer type`))
test$`Customer type` <- factor(test$`Customer type`, unique(df$`Customer type`))

我做了一个玩具数据集并检查了你的代码。有几个问题:

  1. R 更容易使用遵循特定样式的变量名。您的 'Customer type' 变量中有一个 space。通常,避免 space 时编码会更容易。所以我将其重命名为“Customer_type”。对于您的 data.frame,您可以简单地进入源文件,或使用 names(df) <- gsub("Customer type", "Customer_type", names(df))
  2. 我将 'Customer_type' 编码为一个因素。对你来说,这看起来像 df$Customer_type <- factor(df$Customer_type)
  3. sample.split() 的文档说第一个参数 'Y' 应该是标签向量。但是在你的代码中你给了变量名。标签是因子的 levels 的名称。在我的示例中,这些级别是高、中和低。要查看变量的级别,您可以使用 levels(df$Customer_type)。将这些作为字符向量输入到 sample.split()
  4. 如下所示调整 rpart() 调用。

通过这些调整,您的代码可能没问题。

# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
                 Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
                 Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
                 Quantity = sample(1:10, 100, replace = T),
                 Total = sample(1:10, 100, replace = T),
                 Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
                 Rating = factor(sample(1:5, 100, replace = T)))

library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)

#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)