如何将经过插入符号训练的随机森林模型输入到 predict() 和 performance() 函数中?

How to input a caret trained random forest model into predict() and performance() functions?

我想使用 performance() 创建精确召回曲线,但我不知道如何输入我的数据。我按照这个例子。

attach(ROCR.simple)
pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
perf <- performance(pred,"prec","rec")
plot(perf)

我正在尝试针对我的 caret 训练过的 RF 模型 特别是 训练数据(我知道有有关如何在 newdata 上使用 predict 的各种示例)。我试过这个:

pred <- prediction(rf_train_model$pred$case, rf_train_model$pred$pred)
perf <- performance(pred,"prec","rec")
plot(perf)

下面是我的模型。我尝试了上面的方法,因为这似乎与 ROCR.simple 数据匹配。

#create model
ctrl <- trainControl(method = "cv",
                     number = 5,
                     savePredictions = TRUE,
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE)
set.seed(3949)
rf_train_model <- train(outcome ~ ., data=df_train, 
                  method= "rf",
                  ntree = 1500, 
                  tuneGrid = data.frame(mtry = 33), 
                  trControl = ctrl, 
                  preProc=c("center","scale"), 
                  metric="ROC",
                  importance=TRUE)

> head(rf_train_model$pred)
     pred     obs      case   control rowIndex mtry Resample
1 control control 0.3173333 0.6826667        4   33    Fold1
2 control control 0.3666667 0.6333333        7   33    Fold1
3 control control 0.2653333 0.7346667       16   33    Fold1
4 control control 0.1606667 0.8393333       18   33    Fold1
5 control control 0.2840000 0.7160000       20   33    Fold1
6    case    case 0.6206667 0.3793333       25   33    Fold1

这是错误的,因为我的精确召回曲线走错了方向。我不仅对 PRAUC 曲线感兴趣,尽管这是一个 good source 展示了如何制作它,所以我想修复这个错误。我犯了什么错误?

如果您阅读了表演小插曲:

it has to be declared which class label denotes the negative, and which the positive class. Ideally, labels should be supplied as ordered factor(s), the lower level corresponding to the negative class, the upper level to the positive class. If the labels are factors (unordered), numeric, logical or characters, ordering of the labels is inferred from R's built-in < relation (e.g. 0 < 1, -1 < 1, 'a' < 'b', FALSE < TRUE).

在你的情况下,当你提供rf_train_model$pred$pred时,上层仍然是"control",所以最好的办法是让它成为TRUE/FALSE。您还应该提供实际标签,而不是预测标签,rf_train_model$obs。请参阅下面的示例:

library(caret)
library(ROCR)
set.seed(100)
df = data.frame(matrix(runif(100*100),ncol=100))
df$outcome = ifelse(runif(100)>0.5,"case","control")

df_train = df[1:80,]
df_test = df[81:100,]

rf_train_model <- train(outcome ~ ., data=df_train, 
                  method= "rf",
                  ntree = 1500, 
                  tuneGrid = data.frame(mtry = 33), 
                  trControl = ctrl, 
                  preProc=c("center","scale"), 
                  metric="ROC",
                  importance=TRUE)

levels(rf_train_model$pred$pred)
[1] "case"    "control"

plotCurve = function(label,positive_class,prob){
pred = prediction(prob,label==positive_class)
perf <- performance(pred,"prec","rec")
plot(perf)
}

plotCurve(rf_train_model$pred$obs,"case",rf_train_model$pred$case)
plotCurve(rf_test$outcome,"case",predict(rf_train,df_test,type="prob")[,2])