超参数不改变随机森林回归树的结果
Hyperparameters not changing results from random forest regression trees
我正在尝试调整随机森林回归模型的超参数,无论超参数如何变化,所有准确度测量都完全相同。我已经在“钻石”数据集上测试了相同的代码,并且能够重现该问题。这是我的代码:
train = diamonds[,c(1, 5, 8:10)]
x = c(1:6)
folds = sample(x,size = nrow(diamonds), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model.rds")
write.csv(results1, "sample_model.csv", row.names = FALSE)
这是我得到的结果:
到底是什么?
更新:
我将样本量减少到 1000 以允许更快的处理并得到不同的结果,但仍然彼此相同。代码:
train = diamonds[,c(1, 5, 8:10)]
train = train[c(1:1000),]
x = c(1:6)
folds = sample(x,size = nrow(train), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model2.rds")
write.csv(results1, "sample_model2.csv", row.names = FALSE)
结果:
这似乎是您 cross-validation 折叠的问题。当我 运行 你的代码并查看 model
的结果时,它说:
Summary of sample sizes: 1, 1, 1, 1, 1, 1, ...
表示每一折只有1个样本量
我认为如果您这样定义 folds
,它会更像您期望的那样工作:
folds<-createFolds(train$carat, k = 6, returnTrain=TRUE)
结果如下所示:
Random Forest
1000 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 832, 833, 835, 834, 834, 832, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.01582362 0.9933839 0.00985451
3 0.01601980 0.9932625 0.00994588
4 0.01567161 0.9935624 0.01018242
Tuning parameter 'splitrule' was held constant at a value
of variance
Tuning parameter 'min.node.size' was held constant
at a value of 20
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 4, splitrule
= variance and min.node.size = 20.
我正在尝试调整随机森林回归模型的超参数,无论超参数如何变化,所有准确度测量都完全相同。我已经在“钻石”数据集上测试了相同的代码,并且能够重现该问题。这是我的代码:
train = diamonds[,c(1, 5, 8:10)]
x = c(1:6)
folds = sample(x,size = nrow(diamonds), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model.rds")
write.csv(results1, "sample_model.csv", row.names = FALSE)
这是我得到的结果:
到底是什么?
更新: 我将样本量减少到 1000 以允许更快的处理并得到不同的结果,但仍然彼此相同。代码:
train = diamonds[,c(1, 5, 8:10)]
train = train[c(1:1000),]
x = c(1:6)
folds = sample(x,size = nrow(train), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model2.rds")
write.csv(results1, "sample_model2.csv", row.names = FALSE)
结果:
这似乎是您 cross-validation 折叠的问题。当我 运行 你的代码并查看 model
的结果时,它说:
Summary of sample sizes: 1, 1, 1, 1, 1, 1, ...
表示每一折只有1个样本量
我认为如果您这样定义 folds
,它会更像您期望的那样工作:
folds<-createFolds(train$carat, k = 6, returnTrain=TRUE)
结果如下所示:
Random Forest
1000 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 832, 833, 835, 834, 834, 832, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.01582362 0.9933839 0.00985451
3 0.01601980 0.9932625 0.00994588
4 0.01567161 0.9935624 0.01018242
Tuning parameter 'splitrule' was held constant at a value
of variance
Tuning parameter 'min.node.size' was held constant
at a value of 20
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 4, splitrule
= variance and min.node.size = 20.