R - confusionMatrix() - sort.list(y) 中的错误：'x' 对于 'sort.list' 必须是原子的

Question

我正在尝试使用带有随机森林的 train() 来做关于实用机器学习的 coursera 项目。但是我遇到了两个问题。由于原始数据集很大，我用 2 个小数据框复制了这个问题，如下所示。

输入

library(caret)
f = data.frame(x = 1:10, y = 11:20)
f2 = data.frame(x = 1:5, y = 6:10)
fit <- train(y~., data = f, method="lm")
pred <- predict(fit, newdata = f2)
confusionMatrix(pred, f2)

输出（主要问题）

Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

如果我使用 table 函数而不是 confusionMatrix，我会得到以下结果：

Error in table(pred, data = f2) : all arguments must have the same length

虽然pred的长度是5，f2$y的长度也是5。

附带说明一下，此示例中的 fit 函数偶尔也会给我一个我也不理解的错误。

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
There were missing values in resampled performance measures.

编辑：语法

Answer 1

我认为您遇到了三个问题。

confusionMatrix 需要两个向量，但 f2 是一个数据框。相反，做 confusionMatrix(pred, f2$y)。
但这给出了不同的错误：The data must contain some levels that overlap the reference.。这就引出了第二个问题。如果您查看 f2 的预测值和实际值，则没有重叠。本质上，f和f2代表了x和y之间完全不同的关系。你可以通过绘图看到这一点。
```
library(tidyverse)
theme_set(theme_classic())

ggplot(bind_rows(f=f,f2=f2, .id="source"), aes(x,y,colour=source)) +
  geom_point() +
  geom_smooth(method="lm") 
```
此外，假数据中没有噪声，因此拟合非常完美（RMSE = 0 且 R 平方 = 1）。
```
fit
```
```
Resampling results:

  RMSE          Rsquared
  1.650006e-15  1
```
假数据集有一个连续的结果变量。但是，混淆矩阵是一种用于检查 class 化模型质量的工具，即结果是分类数据而不是连续数据。在这种情况下，您将使用适合 classification 的逻辑回归、随机森林等模型，而不是线性回归模型。然后，您将使用 confusionMatrix 将预测的 class 与实际的 class 进行比较。

这是一个例子：

library(caret)

# Fake data
set.seed(100)
f = data.frame(y = c(rep(c("A","B"), c(100,25)),rep(c("B","A"), c(100,25))),
               x = c(rnorm(125, 1, 1), rnorm(125, 3, 1)))

# Train model on training data
set.seed(50)
idx = sample(1:nrow(f), 200)  # Indices of training observations
fit <- train(y ~ ., data = f[idx,], method="glm")

# Get predictions on probability scale
pred <- predict(fit, newdata=f[-idx, ], type="prob")

# Create data frame for confusion matrix
results = data.frame(pred=ifelse(pred$A < 0.5, "B","A"),
                     actual=f$y[-idx])

confusionMatrix(results$pred, results$actual)

Confusion Matrix and Statistics

          Reference
Prediction  A  B
         A 16  7
         B  6 21

               Accuracy : 0.74            
                 95% CI : (0.5966, 0.8537)
    No Information Rate : 0.56            
    P-Value [Acc > NIR] : 0.006698        

                  Kappa : 0.475           
 Mcnemar's Test P-Value : 1.000000        

            Sensitivity : 0.7273          
            Specificity : 0.7500          
         Pos Pred Value : 0.6957          
         Neg Pred Value : 0.7778          
             Prevalence : 0.4400          
         Detection Rate : 0.3200          
   Detection Prevalence : 0.4600          
      Balanced Accuracy : 0.7386          

       'Positive' Class : A

R - confusionMatrix() - sort.list(y) 中的错误：'x' 对于 'sort.list' 必须是原子的

R - confusionMatrix() - Error in sort.list(y) : 'x' must be atomic for 'sort.list'

r

machine-learning

confusion-matrix

random-forest

r-caret