Tidymodels(Fitting a random forest with fit_samples()): Fold01: internal: Error: Must group by variables found in `.data`
Tidymodels(Fitting a random forest with fit_samples()): Fold01: internal: Error: Must group by variables found in `.data`
概览
我制作了一个随机森林回归模型,我的目标是使用函数fit_samples()[=来拟合模型36=] 函数,然后调整 超参数 。但是,我遇到以下错误消息:
错误信息:
! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold01: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold02: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold03: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
我已在线搜索解决方案,但找不到与我的特定问题相符的问题。我不是高级 R 用户,我正在尽最大努力通过 Tidymodels 包慢慢操纵自己
如果有人可以帮助解决此错误消息,我将不胜感激。
非常感谢
R-code
seed(45L)
#Open libraries
library(tidymodels)
library(ranger)
library(dplyr)
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
#Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
#resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
#Produce the random forest model
mod_rf <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
##Workflow
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
##Fit model
plan(multisession)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
#Error Message
! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold01: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold02: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold03: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
数据帧 FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
如果您查看 fit_resamples 的帮助页面:
fit_resamples() computes a set of performance metrics across one or
more resamples. It does not perform any tuning (see tune_grid() and
tune_bayes() for that)
很可能需要先调优,然后运行 fit_resamples() 使用调优得到的参数,例如:
rf_grid <- expand.grid(mtry = 2:4, min_n = c(10,15,20))
mod_rf <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
rf_res <-
wflow_rf %>%
tune_grid(
resamples = cv,grid = rf_grid
)
找到最佳参数:
show_best(rf_res,metric="rmse")
# A tibble: 5 x 7
mtry min_n .metric .estimator mean n std_err
<int> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 4 10 rmse standard 7.87 10 0.743
2 4 15 rmse standard 8.27 10 0.649
3 3 10 rmse standard 8.49 10 0.682
4 3 15 rmse standard 8.97 10 0.620
5 4 20 rmse standard 9.49 10 0.605
然后 运行 再说一遍:
mod_rf <- rand_forest(mtry = 4,trees = 1000,min_n = 10) %>%
set_mode("regression") %>%
set_engine("ranger")
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
概览
我制作了一个随机森林回归模型,我的目标是使用函数fit_samples()[=来拟合模型36=] 函数,然后调整 超参数 。但是,我遇到以下错误消息:
错误信息:
! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold01: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold02: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold03: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
我已在线搜索解决方案,但找不到与我的特定问题相符的问题。我不是高级 R 用户,我正在尽最大努力通过 Tidymodels 包慢慢操纵自己
如果有人可以帮助解决此错误消息,我将不胜感激。
非常感谢
R-code
seed(45L)
#Open libraries
library(tidymodels)
library(ranger)
library(dplyr)
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
#Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
#resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
#Produce the random forest model
mod_rf <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
##Workflow
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
##Fit model
plan(multisession)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
#Error Message
! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold01: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold02: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold03: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
数据帧 FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
如果您查看 fit_resamples 的帮助页面:
fit_resamples() computes a set of performance metrics across one or more resamples. It does not perform any tuning (see tune_grid() and tune_bayes() for that)
很可能需要先调优,然后运行 fit_resamples() 使用调优得到的参数,例如:
rf_grid <- expand.grid(mtry = 2:4, min_n = c(10,15,20))
mod_rf <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
rf_res <-
wflow_rf %>%
tune_grid(
resamples = cv,grid = rf_grid
)
找到最佳参数:
show_best(rf_res,metric="rmse")
# A tibble: 5 x 7
mtry min_n .metric .estimator mean n std_err
<int> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 4 10 rmse standard 7.87 10 0.743
2 4 15 rmse standard 8.27 10 0.649
3 3 10 rmse standard 8.49 10 0.682
4 3 15 rmse standard 8.97 10 0.620
5 4 20 rmse standard 9.49 10 0.605
然后 运行 再说一遍:
mod_rf <- rand_forest(mtry = 4,trees = 1000,min_n = 10) %>%
set_mode("regression") %>%
set_engine("ranger")
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))