使用 Caret 的 5 折交叉验证中随机森林的属性

Question

考虑使用随机森林方法在 Caret 中进行 5 折交叉验证，每折构建的随机森林的属性是什么？例如在鸢尾花数据集中：

train_control <- trainControl(method="cv", number=5,savePredictions = TRUE) 
output <- train(Species~., data=iris, trControl=train_control, method="rf")
output$results$mtry
[1] 2 3 4

3个mtry值，在交叉验证中构建了3个不同的森林是真的吗？我怎样才能像mtry一样了解每个折叠森林的细节？

Answer 1

默认情况下，插入符号序列功能将进行网格搜索以获得最佳 mtry。如果没有提供网格搜索的长度，它将进行长度为 3 的搜索。

这些默认值可见于：

?trainControl
?train

tuneLength = ifelse(trControl$method == "none", 1, 3))
search = "grid"

当指定网格搜索（默认）且长度为 3（默认）时，使用插入符号函数 var_seq 找到 mtry 参数。这从rf train method就可以看出。根据特征的数量，此函数具有不同的行为。由于少于 500 个特征，它选择 mtry 作为：

floor(seq(2, to = p, length = len))

其中 p 是特征数。 Iris 数据有 4 个特征，因此 len 为 3 个时，可用的 mtry 值分别为 2、3 和 4。

因此这三个mtry值都是在5折CV中测试的。所以基本上做了15个射频模型。每次尝试 5 次。最后，根据 CV 结果选择最佳 mtry，并在整个列车数据上构建最终模型 - 第 16 个模型。

使用 Caret 的 5 折交叉验证中随机森林的属性

properties of Random Forest in 5 fold cross validation using Caret

r

random-forest

cross-validation

r-caret