Caret 中每次交叉验证的训练集和测试集的 ROC 曲线

Question

是否可以在 Caret 中的 5 折交叉验证中分别为训练集和测试集设置 ROC 曲线？

library(caret)
train_control <- trainControl(method="cv", number=5,savePredictions =  TRUE,classProbs = TRUE)
output <- train(Species~., data=iris, trControl=train_control, method="rf")

我可以执行以下操作，但我不知道 returns Fold1 训练集或测试集的 ROC：

library(pROC) 
selectedIndices <- rfmodel$pred$Resample == "Fold1"
plot.roc(rfmodel$pred$obs[selectedIndices],rfmodel$pred$setosa[selectedIndices])

Answer 1

documentation 确实对 rfmodel$pred 的内容一点也不清楚 - 我敢打赌，其中包含的预测是针对用作测试集的折叠，但我不能指向文档中的任何证据；尽管如此，无论如何，您在尝试获取 ROC 的过程中仍然遗漏了一些要点。

首先，让我们将 rfmodel$pred 隔离在一个单独的数据框中以便于处理：

dd <- rfmodel$pred

nrow(dd)
# 450

为什么是 450 行？这是因为您尝试了 3 个不同的参数集（在您的例子中，mtry 只使用了 3 个不同的值）：

rfmodel$results
# output:
  mtry Accuracy Kappa AccuracySD    KappaSD
1    2     0.96  0.94 0.04346135 0.06519202
2    3     0.96  0.94 0.04346135 0.06519202
3    4     0.96  0.94 0.04346135 0.06519202

和 150 行 X 3 设置 = 450。

让我们仔细看看rfmodel$pred的内容：

head(dd)

# result:
    pred    obs setosa versicolor virginica rowIndex mtry Resample
1 setosa setosa  1.000      0.000         0        2    2    Fold1
2 setosa setosa  1.000      0.000         0        3    2    Fold1
3 setosa setosa  1.000      0.000         0        6    2    Fold1
4 setosa setosa  0.998      0.002         0       24    2    Fold1
5 setosa setosa  1.000      0.000         0       33    2    Fold1
6 setosa setosa  1.000      0.000         0       38    2    Fold1

第 obs 列包含真实值
三列 setosa、versicolor 和 virginica 包含为每个 class 计算的相应概率，以及每行加起来为 1
列pred包含最终预测，即上述三列中概率最大的class

如果这就是整个故事，那么您绘制 ROC 的方式就可以了，即：

selectedIndices <- rfmodel$pred$Resample == "Fold1"
plot.roc(rfmodel$pred$obs[selectedIndices],rfmodel$pred$setosa[selectedIndices])

但这还不是全部（仅仅存在 450 行而不是 150 行就应该已经给出了提示）：注意存在一个名为 mtry[= 的列66=]；实际上，rfmodel$pred 包括所有次交叉验证的结果（即所有参数设置）：

tail(dd) # result: pred obs setosa versicolor virginica rowIndex mtry Resample 445 virginica virginica 0 0.004 0.996 112 4 Fold5 446 virginica virginica 0 0.000 1.000 113 4 Fold5 447 virginica virginica 0 0.020 0.980 115 4 Fold5 448 virginica virginica 0 0.000 1.000 118 4 Fold5 449 virginica virginica 0 0.394 0.606 135 4 Fold5 450 virginica virginica 0 0.000 1.000 140 4 Fold5

这就是您的selectedIndices计算不正确的根本原因；它还应该包括 mtry 的特定选择，否则 ROC 没有任何意义，因为它 "aggregates" 不止一个模型：

selectedIndices <- rfmodel$pred$Resample == "Fold1" & rfmodel$pred$mtry == 2

--

正如我一开始所说，我打赌rfmodel$pred中的预测是针对文件夹作为测试集的；实际上，如果我们手动计算准确度，它们与上面显示的 rfmodel$results 中报告的准确度一致（所有 3 种设置均为 0.96），我们知道这是用于用作 test[=62= 的文件夹]（可以说，各自的训练精度都是1.0）：

for (i in 2:4) { # mtry values in {2, 3, 4} acc = (length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold1'))/30 + length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold2'))/30 + length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold3'))/30 + length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold4'))/30 + length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold5'))/30 )/5 print(acc) } # result: [1] 0.96 [1] 0.96 [1] 0.96

Caret 中每次交叉验证的训练集和测试集的 ROC 曲线

ROC curve for Training set and Test set for each fold of cross validation in Caret

r

machine-learning

roc

cross-validation

r-caret