超参数不改变随机森林回归树的结果

Hyperparameters not changing results from random forest regression trees

我正在尝试调整随机森林回归模型的超参数,无论超参数如何变化,所有准确度测量都完全相同。我已经在“钻石”数据集上测试了相同的代码,并且能够重现该问题。这是我的代码:

train = diamonds[,c(1, 5, 8:10)]
x = c(1:6)
folds = sample(x,size = nrow(diamonds), replace = T)

rf_grid = expand.grid(.mtry = c(2:4),
                      .splitrule = "variance",
                      .min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
               train$carat,
               method="ranger",
               importance = "impurity",
               metric = "RMSE",
               tuneGrid = rf_grid,
               trControl = trainControl(method="cv",
                                        index=folds, 
                                        search = "random"),
               num.trees = 10,
               tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model.rds")
write.csv(results1, "sample_model.csv", row.names = FALSE)

这是我得到的结果:

到底是什么?

更新: 我将样本量减少到 1000 以允许更快的处理并得到不同的结果,但仍然彼此相同。代码:

train = diamonds[,c(1, 5, 8:10)]
train = train[c(1:1000),]
x = c(1:6)
folds = sample(x,size = nrow(train), replace = T)

rf_grid = expand.grid(.mtry = c(2:4),
                      .splitrule = "variance",
                      .min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
               train$carat,
               method="ranger",
               importance = "impurity",
               metric = "RMSE",
               tuneGrid = rf_grid,
               trControl = trainControl(method="cv",
                                        index=folds, 
                                        search = "random"),
               num.trees = 10,
               tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model2.rds")
write.csv(results1, "sample_model2.csv", row.names = FALSE)

结果:

这似乎是您 cross-validation 折叠的问题。当我 运行 你的代码并查看 model 的结果时,它说:

Summary of sample sizes: 1, 1, 1, 1, 1, 1, ...

表示每一折只有1个样本量

我认为如果您这样定义 folds,它会更像您期望的那样工作:

folds<-createFolds(train$carat, k = 6, returnTrain=TRUE)

结果如下所示:

Random Forest 

1000 samples
   4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 832, 833, 835, 834, 834, 832, ... 
Resampling results across tuning parameters:

  mtry  RMSE        Rsquared   MAE       
  2     0.01582362  0.9933839  0.00985451
  3     0.01601980  0.9932625  0.00994588
  4     0.01567161  0.9935624  0.01018242

Tuning parameter 'splitrule' was held constant at a value
 of variance
Tuning parameter 'min.node.size' was held constant
 at a value of 20
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 4, splitrule
 = variance and min.node.size = 20.