r caretEnsemble 警告:索引未在 trControl 中定义

r caretEnsemble warning: indexes not defined in trControl

我有一些 r/caret 代码可以将多个交叉验证模型拟合到某些数据,但我收到一条警告消息,提示我无法找到相关信息。这是我应该关心的事情吗?

library(datasets)
library(caret)
library(caretEnsemble)

# load data
data("iris")

# establish cross-validation structure
set.seed(32)
trainControl <- trainControl(method="repeatedcv", number=5, repeats=3, savePredictions=TRUE, search="random")

# fit several (cross-validated) models 
algorithmList <- c('lda',         # Linear Discriminant Analysis 
                   'rpart' ,      # Classification and Regression Trees
                   'svmRadial')   # SVM with RBF Kernel

models <- caretList(Species~., data=iris, trControl=trainControl, methodList=algorithmList)

日志输出:

Warning messages:
1: In trControlCheck(x = trControl, y = target) :
  x$savePredictions == TRUE is depreciated. Setting to 'final' instead.
2: In trControlCheck(x = trControl, y = target) :
  indexes not defined in trControl.  Attempting to set them ourselves, so each model in the ensemble will have the same resampling indexes.

...我认为我的 trainControl 对象,定义一个交叉验证结构(3x 5 折交叉验证)将为 cv 拆分生成一组索引。所以我很困惑为什么我会收到这条消息。

trainControl 默认情况下不会为您生成索引,它作为一种将所有参数传递给您正在训练的每个模型的方式。

当我们搜索 github 有关错误的问题时,我们可以找到 this particular issue

You need to make sure that every model is fit with the EXACT same resampling folds. caretEnsemble builds the ensemble by merging together the test sets for each cross-validation fold, and you will get incorrect results if each fold has different observations in it.

Before you fit your models, you need to construct a trainControl object, and manually set the indexes in that object.

E.g. myControl <- trainControl(index=createFolds(y, 10)).

We are working on an interface to caretEnsemble that handles constructing the resampling strategy for you and then fitting multiple models using those resamples, but it is not yet finished.

To reiterate, that check is there for a reason. You need to set the index argument in trainControl, and pass the EXACT SAME indexes to each model you wish to ensemble.

所以这意味着当您指定 number = 5repeats = 3 时,模型实际上并没有获得关于样本属于每个折叠的预定索引,而是独立生成它们自己的索引。

因此,为了确保模型在哪些样本属于哪些折叠方面彼此一致,您必须在 trainControl 对象中指定 index = createFolds(iris$Species, 5)

# new trainControl object with index specified
trainControl <- trainControl(method = "repeatedcv",
                             number = 5,
                             index = createFolds(iris$Species, 5),
                             repeats = 3,
                             savePredictions = "all",
                             search = "random")