Tidymodels:涉及使用 R 中的函数 tune_grid() 进行 10 折交叉验证的可调模型
Tidymodels: Tunable models involving 10-fold Cross Validation Using the Function tune_grid() in R
概览
我使用带有数据框 FID 的 tidymodels 包生成了四个模型(见下文):
- 一般线性模型
- 袋装树
- 随机森林
- 增强树
数据框包含三个预测变量:
- 年份(数字)
- 月份(因子)
- 天(数字)
因变量是频率(数值)
原来的正则化惩罚是0.1,我选的有点武断。我的目标是估计正确的或最佳的正则化参数惩罚。这个想法是估计模型超参数(最佳价值模型),在模型训练期间无法评估。我试图通过在重采样数据集上训练许多模型并探索它们的表现来估计最佳惩罚值。因此,我正在构建一个用于模型调整的新模型规范。
我正在学习本教程:-
https://smltar.com/mlregression.html#firstregressionevaluation
我遇到此错误消息
Error: A `model` action has already been added to this workflow.
#Run rlang::last_error()
<error/rlang_error>
A `model` action has already been added to this workflow.
Backtrace:
1. tune::tune_grid(...)
10. workflows::add_model(., tune_spec_glm)
11. workflows:::add_action(x, action, "model")
13. workflows:::add_action_impl.action_fit(x, action, name)
14. workflows:::check_singleton(x$fit$actions, name)
15. workflows:::glubort("A `{name}` action has already been added to this workflow.")
Run `rlang::last_trace()` to see the full context.
如果有人能帮我解决这个问题,我将不胜感激。
非常感谢。
R-code
##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
##########################################################
##Produce Models
##########################################################
##General Linear Models
##########################################################
##Produce the glm model
mod_glm<-linear_reg(mode="regression",
penalty = 0.1,
mixture = 1) %>%
set_engine("glmnet")
##Create workflow
wflow_glm <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_glm)
##Fit the glm model
###########################################################################
MODEL EVALUATION
##Estimate how well that model performs, let’s fit many times,
##once to each of these resampled folds, and then evaluate on the heldout
##part of each resampled fold.
##########################################################################
plan(multisession)
fit_glm <- fit_resamples(
wflow_glm,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
##Collect model predictions for each K-fold for the number of blue whale sightings
Predictions<-fit_glm %>%
collect_predictions()
#######Tuning hyperparameters
##Estimating the best regularization penalty to configure the best value model
##by estimating the best value by training many models on resamples data sets
##and exploring how well these models perform
tune_spec_glm <- linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
tune_spec_glm
##Create a regular grid of value to try using a convenience function for
##penalty
lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid
####
tune_rs <- tune_grid(
wflow_glm %>% add_model(tune_spec_glm),
cv,
grid = lambda_grid,
control = control_resamples(save_pred = TRUE)
)
##Error message
Error: A `model` action has already been added to this workflow.
Run `rlang::last_error()` to see where the error occurred.
数据帧 - FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
您应该使用 update_model()
而不是 add_model()
。
tune_rs <- tune_grid(
wflow_glm %>% update_model(tune_spec_glm),
cv,
grid = lambda_grid,
control = control_resamples(save_pred = TRUE)
)
我也可以对您的示例提出一些一般性评论:
- 我修改了下面几行
train_data <- training(FID)
test_data <- testing(FID)
到
train_data <- training(data_split)
test_data <- testing(data_split)
我猜这是你为这个问题做例子时的错字,因为它给出了错误。
- 食谱应该在训练分割上进行训练,否则会出现数据泄漏。
在您的代码中,这实际上无关紧要,因为训练 prep() 是在使用训练数据的工作流中执行的
rec <- recipe(Frequency ~ ., data = train_data) %>%
- 您可以对您的问题使用泊松回归,因为结果是正整数。在 tidymodels 中,你可以使用
poissonreg::poisson_reg()
https://poissonreg.tidymodels.org/
概览
我使用带有数据框 FID 的 tidymodels 包生成了四个模型(见下文):
- 一般线性模型
- 袋装树
- 随机森林
- 增强树
数据框包含三个预测变量:
- 年份(数字)
- 月份(因子)
- 天(数字)
因变量是频率(数值)
原来的正则化惩罚是0.1,我选的有点武断。我的目标是估计正确的或最佳的正则化参数惩罚。这个想法是估计模型超参数(最佳价值模型),在模型训练期间无法评估。我试图通过在重采样数据集上训练许多模型并探索它们的表现来估计最佳惩罚值。因此,我正在构建一个用于模型调整的新模型规范。
我正在学习本教程:-
https://smltar.com/mlregression.html#firstregressionevaluation
我遇到此错误消息
Error: A `model` action has already been added to this workflow.
#Run rlang::last_error()
<error/rlang_error>
A `model` action has already been added to this workflow.
Backtrace:
1. tune::tune_grid(...)
10. workflows::add_model(., tune_spec_glm)
11. workflows:::add_action(x, action, "model")
13. workflows:::add_action_impl.action_fit(x, action, name)
14. workflows:::check_singleton(x$fit$actions, name)
15. workflows:::glubort("A `{name}` action has already been added to this workflow.")
Run `rlang::last_trace()` to see the full context.
如果有人能帮我解决这个问题,我将不胜感激。
非常感谢。
R-code
##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
##########################################################
##Produce Models
##########################################################
##General Linear Models
##########################################################
##Produce the glm model
mod_glm<-linear_reg(mode="regression",
penalty = 0.1,
mixture = 1) %>%
set_engine("glmnet")
##Create workflow
wflow_glm <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_glm)
##Fit the glm model
###########################################################################
MODEL EVALUATION
##Estimate how well that model performs, let’s fit many times,
##once to each of these resampled folds, and then evaluate on the heldout
##part of each resampled fold.
##########################################################################
plan(multisession)
fit_glm <- fit_resamples(
wflow_glm,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
##Collect model predictions for each K-fold for the number of blue whale sightings
Predictions<-fit_glm %>%
collect_predictions()
#######Tuning hyperparameters
##Estimating the best regularization penalty to configure the best value model
##by estimating the best value by training many models on resamples data sets
##and exploring how well these models perform
tune_spec_glm <- linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
tune_spec_glm
##Create a regular grid of value to try using a convenience function for
##penalty
lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid
####
tune_rs <- tune_grid(
wflow_glm %>% add_model(tune_spec_glm),
cv,
grid = lambda_grid,
control = control_resamples(save_pred = TRUE)
)
##Error message
Error: A `model` action has already been added to this workflow.
Run `rlang::last_error()` to see where the error occurred.
数据帧 - FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
您应该使用 update_model()
而不是 add_model()
。
tune_rs <- tune_grid(
wflow_glm %>% update_model(tune_spec_glm),
cv,
grid = lambda_grid,
control = control_resamples(save_pred = TRUE)
)
我也可以对您的示例提出一些一般性评论:
- 我修改了下面几行
train_data <- training(FID)
test_data <- testing(FID)
到
train_data <- training(data_split)
test_data <- testing(data_split)
我猜这是你为这个问题做例子时的错字,因为它给出了错误。
- 食谱应该在训练分割上进行训练,否则会出现数据泄漏。
在您的代码中,这实际上无关紧要,因为训练 prep() 是在使用训练数据的工作流中执行的
rec <- recipe(Frequency ~ ., data = train_data) %>%
- 您可以对您的问题使用泊松回归,因为结果是正整数。在 tidymodels 中,你可以使用
poissonreg::poisson_reg()
https://poissonreg.tidymodels.org/