使用插入符号构建随机森林
Building a RandomForest with caret
我试图按照步骤 here 在插入符号中构建一个随机森林模型。本质上,他们设置了 RandomForest,然后是最好的 mtry,然后是最好的 maxnodes,然后是最好的树数。这些步骤很有意义,但是搜索这三个因素的相互作用而不是一次搜索一个因素不是更好吗?
其次,我了解对 mtry 和 ntrees 执行网格搜索。但我不知道设置最小节点数或最大节点数。是否通常建议保留默认节点大小,如下所示?
library(randomForest)
library(caret)
mtrys<-seq(1,4,1)
ntrees<-c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)
combo_mtrTrees<-data.frame(expand.grid(mtrys, ntrees))
colnames(combo_mtrTrees)<-c('mtrys','ntrees')
tuneGrid <- expand.grid(.mtry = c(1: 4))
for (i in 1:length(ntrees)){
ntree<-ntrees[i]
set.seed(65)
rf_maxtrees <- train(Species~.,
data = df,
method = "rf",
importance=TRUE,
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trainControl( method = "cv",
number=5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final"),
ntree = ntree
)
Acc1<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==1]
Acc2<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==2]
Acc3<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==3]
Acc4<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==4]
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==1 & combo_mtrTrees$ntrees==ntree]<-Acc1
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==2 & combo_mtrTrees$ntrees==ntree]<-Acc2
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==3 & combo_mtrTrees$ntrees==ntree]<-Acc3
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==4 & combo_mtrTrees$ntrees==ntree]<-Acc4
}
对,最好搜索一下参数的交互作用
nodesize
和 maxnodes
通常保留为默认值,但没有理由不调整它们。就我个人而言,我会保留 maxnodes
默认值并可能调整 nodesize
- 它可以被视为正则化参数。要了解要尝试的值,请检查 rf
中的默认值,其中 1 用于分类,5 用于回归。所以尝试 1-10 是一种选择。
像您的示例一样在循环中执行调整时,建议始终使用相同的交叉验证折叠。您可以在调用循环之前使用 createFolds
创建它们。
调优后一定要在独立的验证集上评估您的结果,或者执行 nested cross validation,其中内循环将用于调整参数,外循环将用于估计模型性能。由于仅交叉验证的结果将存在乐观偏差。
在大多数情况下,准确性不是选择最佳分类模型的合适指标。特别是在数据集不平衡的情况下。阅读接收器操作特性 auc、Cohen 的 kappa、Matthews 相关系数、平衡精度、F1 分数、分类阈值调整。
这里有一个关于如何联合调整 rf
参数的例子。我将使用 mlbench
包中的 Sonar 数据集。
创建预定义折叠:
library(caret)
library(mlbench)
data(Sonar)
set.seed(1234)
cv_folds <- createFolds(Sonar$Class, k = 5, returnTrain = TRUE)
创建调谐控件:
tuneGrid <- expand.grid(.mtry = c(1 : 10))
ctrl <- trainControl(method = "cv",
number = 5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final",
index = cv_folds,
summaryFunction = twoClassSummary) #in most cases a better summary for two class problems
定义其他要调整的参数。我将仅使用几个组合来限制示例的火车时间:
ntrees <- c(500, 1000)
nodesize <- c(1, 5)
params <- expand.grid(ntrees = ntrees,
nodesize = nodesize)
火车:
store_maxnode <- vector("list", nrow(params))
for(i in 1:nrow(params)){
nodesize <- params[i,2]
ntree <- params[i,1]
set.seed(65)
rf_model <- train(Class~.,
data = Sonar,
method = "rf",
importance=TRUE,
metric = "ROC",
tuneGrid = tuneGrid,
trControl = ctrl,
ntree = ntree,
nodesize = nodesize)
store_maxnode[[i]] <- rf_model
}
################### 2021 年 2 月 26 日。
为了避免通用模型名称 - model1、model2 ...,我们可以使用相应的参数命名结果列表:
names(store_maxnode) <- paste("ntrees:", params$ntrees,
"nodesize:", params$nodesize)
################### 2021 年 2 月 26 日。
合并结果:
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
输出:
Call:
summary.resamples(object = results_mtry)
Models: ntrees: 500 nodesize: 1, ntrees: 1000 nodesize: 1, ntrees: 500 nodesize: 5, ntrees: 1000 nodesize: 5
Number of resamples: 5
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.9108696 0.9354067 0.9449761 0.9465758 0.9688995 0.9727273 0
ntrees: 1000 nodesize: 1 0.8847826 0.9473684 0.9569378 0.9474828 0.9665072 0.9818182 0
ntrees: 500 nodesize: 5 0.9163043 0.9377990 0.9569378 0.9481652 0.9593301 0.9704545 0
ntrees: 1000 nodesize: 5 0.9000000 0.9342105 0.9521531 0.9462321 0.9641148 0.9806818 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.9090909 0.9545455 0.9545455 0.9549407 0.9565217 1.0000000 0
ntrees: 1000 nodesize: 1 0.9090909 0.9130435 0.9545455 0.9371542 0.9545455 0.9545455 0
ntrees: 500 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217 0
ntrees: 1000 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.65 0.6842105 0.7368421 0.7421053 0.7894737 0.8500000 0
ntrees: 1000 nodesize: 1 0.60 0.6842105 0.7894737 0.7631579 0.8421053 0.9000000 0
ntrees: 500 nodesize: 5 0.55 0.6842105 0.7894737 0.7331579 0.8000000 0.8421053 0
ntrees: 1000 nodesize: 5 0.60 0.6842105 0.7368421 0.7321053 0.7894737 0.8500000 0
为每个模型获得最佳 mtry:
lapply(store_maxnode, function(x) x$best)
#output
$`ntrees: 500 nodesize: 1`
mtry
1 1
$`ntrees: 1000 nodesize: 1`
mtry
2 2
$`ntrees: 500 nodesize: 5`
mtry
1 1
$`ntrees: 1000 nodesize: 5`
mtry
1 1
################### 26.02.2021.
或者为每个模型获得最佳平均性能
lapply(store_maxnode, function(x) x$results[x$results$ROC == max(x$results$ROC),])
#output
$`ntrees: 500 nodesize: 1`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9465758 0.9549407 0.7421053 0.02541895 0.03215337 0.0802308
$`ntrees: 1000 nodesize: 1`
mtry ROC Sens Spec ROCSD SensSD SpecSD
2 2 0.9474828 0.9371542 0.7631579 0.03728797 0.02385499 0.1209382
$`ntrees: 500 nodesize: 5`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9481652 0.9458498 0.7331579 0.02133659 0.02056666 0.1177407
$`ntrees: 1000 nodesize: 5`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9462321 0.9458498 0.7321053 0.03091747 0.02056666 0.0961229
从这个玩具示例中,您可以看到 ROC 曲线 (ROC) 下的最高平均(超过 5 倍)面积是通过 ntrees:500、节点大小:5 和 mtry:1 实现的,它等于 0.948 .
###################
或者您可以使用默认摘要
ctrl <- trainControl(method = "cv",
number = 5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final",
index = cv_folds)
并在train
中定义metric = "Kappa"
我试图按照步骤 here 在插入符号中构建一个随机森林模型。本质上,他们设置了 RandomForest,然后是最好的 mtry,然后是最好的 maxnodes,然后是最好的树数。这些步骤很有意义,但是搜索这三个因素的相互作用而不是一次搜索一个因素不是更好吗?
其次,我了解对 mtry 和 ntrees 执行网格搜索。但我不知道设置最小节点数或最大节点数。是否通常建议保留默认节点大小,如下所示?
library(randomForest)
library(caret)
mtrys<-seq(1,4,1)
ntrees<-c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)
combo_mtrTrees<-data.frame(expand.grid(mtrys, ntrees))
colnames(combo_mtrTrees)<-c('mtrys','ntrees')
tuneGrid <- expand.grid(.mtry = c(1: 4))
for (i in 1:length(ntrees)){
ntree<-ntrees[i]
set.seed(65)
rf_maxtrees <- train(Species~.,
data = df,
method = "rf",
importance=TRUE,
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trainControl( method = "cv",
number=5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final"),
ntree = ntree
)
Acc1<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==1]
Acc2<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==2]
Acc3<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==3]
Acc4<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==4]
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==1 & combo_mtrTrees$ntrees==ntree]<-Acc1
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==2 & combo_mtrTrees$ntrees==ntree]<-Acc2
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==3 & combo_mtrTrees$ntrees==ntree]<-Acc3
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==4 & combo_mtrTrees$ntrees==ntree]<-Acc4
}
对,最好搜索一下参数的交互作用
nodesize
和maxnodes
通常保留为默认值,但没有理由不调整它们。就我个人而言,我会保留maxnodes
默认值并可能调整nodesize
- 它可以被视为正则化参数。要了解要尝试的值,请检查rf
中的默认值,其中 1 用于分类,5 用于回归。所以尝试 1-10 是一种选择。像您的示例一样在循环中执行调整时,建议始终使用相同的交叉验证折叠。您可以在调用循环之前使用
createFolds
创建它们。调优后一定要在独立的验证集上评估您的结果,或者执行 nested cross validation,其中内循环将用于调整参数,外循环将用于估计模型性能。由于仅交叉验证的结果将存在乐观偏差。
在大多数情况下,准确性不是选择最佳分类模型的合适指标。特别是在数据集不平衡的情况下。阅读接收器操作特性 auc、Cohen 的 kappa、Matthews 相关系数、平衡精度、F1 分数、分类阈值调整。
这里有一个关于如何联合调整
rf
参数的例子。我将使用mlbench
包中的 Sonar 数据集。
创建预定义折叠:
library(caret)
library(mlbench)
data(Sonar)
set.seed(1234)
cv_folds <- createFolds(Sonar$Class, k = 5, returnTrain = TRUE)
创建调谐控件:
tuneGrid <- expand.grid(.mtry = c(1 : 10))
ctrl <- trainControl(method = "cv",
number = 5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final",
index = cv_folds,
summaryFunction = twoClassSummary) #in most cases a better summary for two class problems
定义其他要调整的参数。我将仅使用几个组合来限制示例的火车时间:
ntrees <- c(500, 1000)
nodesize <- c(1, 5)
params <- expand.grid(ntrees = ntrees,
nodesize = nodesize)
火车:
store_maxnode <- vector("list", nrow(params))
for(i in 1:nrow(params)){
nodesize <- params[i,2]
ntree <- params[i,1]
set.seed(65)
rf_model <- train(Class~.,
data = Sonar,
method = "rf",
importance=TRUE,
metric = "ROC",
tuneGrid = tuneGrid,
trControl = ctrl,
ntree = ntree,
nodesize = nodesize)
store_maxnode[[i]] <- rf_model
}
################### 2021 年 2 月 26 日。
为了避免通用模型名称 - model1、model2 ...,我们可以使用相应的参数命名结果列表:
names(store_maxnode) <- paste("ntrees:", params$ntrees,
"nodesize:", params$nodesize)
################### 2021 年 2 月 26 日。
合并结果:
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
输出:
Call:
summary.resamples(object = results_mtry)
Models: ntrees: 500 nodesize: 1, ntrees: 1000 nodesize: 1, ntrees: 500 nodesize: 5, ntrees: 1000 nodesize: 5
Number of resamples: 5
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.9108696 0.9354067 0.9449761 0.9465758 0.9688995 0.9727273 0
ntrees: 1000 nodesize: 1 0.8847826 0.9473684 0.9569378 0.9474828 0.9665072 0.9818182 0
ntrees: 500 nodesize: 5 0.9163043 0.9377990 0.9569378 0.9481652 0.9593301 0.9704545 0
ntrees: 1000 nodesize: 5 0.9000000 0.9342105 0.9521531 0.9462321 0.9641148 0.9806818 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.9090909 0.9545455 0.9545455 0.9549407 0.9565217 1.0000000 0
ntrees: 1000 nodesize: 1 0.9090909 0.9130435 0.9545455 0.9371542 0.9545455 0.9545455 0
ntrees: 500 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217 0
ntrees: 1000 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.65 0.6842105 0.7368421 0.7421053 0.7894737 0.8500000 0
ntrees: 1000 nodesize: 1 0.60 0.6842105 0.7894737 0.7631579 0.8421053 0.9000000 0
ntrees: 500 nodesize: 5 0.55 0.6842105 0.7894737 0.7331579 0.8000000 0.8421053 0
ntrees: 1000 nodesize: 5 0.60 0.6842105 0.7368421 0.7321053 0.7894737 0.8500000 0
为每个模型获得最佳 mtry:
lapply(store_maxnode, function(x) x$best)
#output
$`ntrees: 500 nodesize: 1`
mtry
1 1
$`ntrees: 1000 nodesize: 1`
mtry
2 2
$`ntrees: 500 nodesize: 5`
mtry
1 1
$`ntrees: 1000 nodesize: 5`
mtry
1 1
################### 26.02.2021.
或者为每个模型获得最佳平均性能
lapply(store_maxnode, function(x) x$results[x$results$ROC == max(x$results$ROC),])
#output
$`ntrees: 500 nodesize: 1`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9465758 0.9549407 0.7421053 0.02541895 0.03215337 0.0802308
$`ntrees: 1000 nodesize: 1`
mtry ROC Sens Spec ROCSD SensSD SpecSD
2 2 0.9474828 0.9371542 0.7631579 0.03728797 0.02385499 0.1209382
$`ntrees: 500 nodesize: 5`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9481652 0.9458498 0.7331579 0.02133659 0.02056666 0.1177407
$`ntrees: 1000 nodesize: 5`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9462321 0.9458498 0.7321053 0.03091747 0.02056666 0.0961229
从这个玩具示例中,您可以看到 ROC 曲线 (ROC) 下的最高平均(超过 5 倍)面积是通过 ntrees:500、节点大小:5 和 mtry:1 实现的,它等于 0.948 . ###################
或者您可以使用默认摘要
ctrl <- trainControl(method = "cv",
number = 5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final",
index = cv_folds)
并在train
metric = "Kappa"