如何准备食谱,包括可调参数?
How to prep a recipe, including tunable arguments?
正如您从我的代码中看到的那样,我正在尝试将特征选择包含到我的 tidymodels 工作流程中。我正在使用一些 kaggle 数据,试图预测客户流失。
为了对测试和训练数据应用处理,我在使用 prep() 函数后烘焙食谱。
但是,如果我想对 step_select_roc() 函数 top_p 参数应用调优,我不知道之后如何准备 () 配方。在我的 reprex 中应用它会导致错误。
也许我必须调整我的工作流程并分离一些食谱任务才能完成工作。实现此目标的最佳方法是什么?
#### LIBS
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(themis))
suppressPackageStartupMessages(library(recipeselectors))
#### INPUT
# get dataset from: https://www.kaggle.com/shrutimechlearn/churn-modelling
data <- fread("Churn_Modelling.csv")
# split data
set.seed(seed = 1972)
train_test_split <-
rsample::initial_split(
data = data,
prop = 0.80
)
train_tbl <- train_test_split %>% training()
test_tbl <- train_test_split %>% testing()
#### FEATURE ENGINEERING
# Define the recipe
recipe <- recipe(Exited ~ ., data = train_tbl) %>%
step_rm(one_of("RowNumber", "Surname")) %>%
update_role(CustomerId, new_role = "Helper") %>%
step_num2factor(all_outcomes(),
levels = c("No", "Yes"),
transform = function(x) {x + 1}) %>%
step_normalize(all_numeric(), -has_role(match = "Helper")) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_numeric(), -has_role("Helper")) %>%
step_nzv(all_predictors()) %>%
step_select_roc(all_predictors(), outcome = "Exited", top_p = tune()) %>%
prep()
# Bake it
train_baked <- recipe %>% bake(train_tbl)
test_baked <- recipe %>% bake(test_tbl)
您不能 prep()
具有可调参数的食谱。 Think of prep()
as an analogy for fit()
for a model;如果没有设置超参数,则无法拟合模型。
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = tune::tune())
prep(rec, training = USArrests)
#> Error in `prep()`:
#> ! You cannot `prep()` a tuneable recipe. Argument(s) with `tune()`: 'num_comp'. Do you want to use a tuning function such as `tune_grid()`?
由 reprex package (v2.0.1)
于 2022-02-22 创建
感谢 Steven Pawley 的帮助,我能够将可调 step_roc 参数集成到我的 tidymodels 模型工作流程中。正如 Julia Silge 所提到的,不可能准备具有可调参数的食谱。因此,如果您仍然想准备和烘焙您的食谱,您只能在完成模型和食谱后按照以下示例进行操作:
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(recipeselectors))
suppressPackageStartupMessages(library(finetune))
data(cells, package = "modeldata")
cells <- cells %>% select(-case)
set.seed(31)
split <- initial_split(cells, prop = 0.8)
train <- training(split)
test <- testing(split)
rec <-
recipe(class ~ ., data = train) %>%
step_corr(all_predictors(), threshold = 0.9) %>%
step_select_roc(all_predictors(), outcome = "class", top_p = tune())
# xgboost model
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(),
sample_size = tune(), mtry = tune(),
learn_rate = tune(),
stop_iter = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# grid
xgb_grid <- grid_latin_hypercube(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train),
learn_rate(),
stop_iter(range = c(5L,50L)),
size = 5
)
rec_grid <- grid_latin_hypercube(
parameters(rec) %>%
update(top_p = top_p(c(0,30))) ,
size = 5
)
comp_grid <- merge(xgb_grid, rec_grid)
model_metrics <- metric_set(roc_auc)
rs <- vfold_cv(cells)
ctrl <- control_grid(pkgs = "recipeselectors")
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
rfe_res <-
xgb_spec %>%
tune_grid(
preprocessor = rec,
resamples = rs,
grid = comp_grid,
control = ctrl
)
stopCluster(cl)
best <- rfe_res %>% select_best("roc_auc")
# finalize
final_mod <- finalize_model(xgb_spec, best)
final_rec <- finalize_recipe(rec, best)
# bakery
bake_test <- final_rec %>% prep() %>% bake(new_data = testing(split))
bake_train <- final_rec %>% prep() %>% bake(new_data = training(split))
正如您从我的代码中看到的那样,我正在尝试将特征选择包含到我的 tidymodels 工作流程中。我正在使用一些 kaggle 数据,试图预测客户流失。
为了对测试和训练数据应用处理,我在使用 prep() 函数后烘焙食谱。
但是,如果我想对 step_select_roc() 函数 top_p 参数应用调优,我不知道之后如何准备 () 配方。在我的 reprex 中应用它会导致错误。
也许我必须调整我的工作流程并分离一些食谱任务才能完成工作。实现此目标的最佳方法是什么?
#### LIBS
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(themis))
suppressPackageStartupMessages(library(recipeselectors))
#### INPUT
# get dataset from: https://www.kaggle.com/shrutimechlearn/churn-modelling
data <- fread("Churn_Modelling.csv")
# split data
set.seed(seed = 1972)
train_test_split <-
rsample::initial_split(
data = data,
prop = 0.80
)
train_tbl <- train_test_split %>% training()
test_tbl <- train_test_split %>% testing()
#### FEATURE ENGINEERING
# Define the recipe
recipe <- recipe(Exited ~ ., data = train_tbl) %>%
step_rm(one_of("RowNumber", "Surname")) %>%
update_role(CustomerId, new_role = "Helper") %>%
step_num2factor(all_outcomes(),
levels = c("No", "Yes"),
transform = function(x) {x + 1}) %>%
step_normalize(all_numeric(), -has_role(match = "Helper")) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_numeric(), -has_role("Helper")) %>%
step_nzv(all_predictors()) %>%
step_select_roc(all_predictors(), outcome = "Exited", top_p = tune()) %>%
prep()
# Bake it
train_baked <- recipe %>% bake(train_tbl)
test_baked <- recipe %>% bake(test_tbl)
您不能 prep()
具有可调参数的食谱。 Think of prep()
as an analogy for fit()
for a model;如果没有设置超参数,则无法拟合模型。
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = tune::tune())
prep(rec, training = USArrests)
#> Error in `prep()`:
#> ! You cannot `prep()` a tuneable recipe. Argument(s) with `tune()`: 'num_comp'. Do you want to use a tuning function such as `tune_grid()`?
由 reprex package (v2.0.1)
于 2022-02-22 创建感谢 Steven Pawley 的帮助,我能够将可调 step_roc 参数集成到我的 tidymodels 模型工作流程中。正如 Julia Silge 所提到的,不可能准备具有可调参数的食谱。因此,如果您仍然想准备和烘焙您的食谱,您只能在完成模型和食谱后按照以下示例进行操作:
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(recipeselectors))
suppressPackageStartupMessages(library(finetune))
data(cells, package = "modeldata")
cells <- cells %>% select(-case)
set.seed(31)
split <- initial_split(cells, prop = 0.8)
train <- training(split)
test <- testing(split)
rec <-
recipe(class ~ ., data = train) %>%
step_corr(all_predictors(), threshold = 0.9) %>%
step_select_roc(all_predictors(), outcome = "class", top_p = tune())
# xgboost model
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(),
sample_size = tune(), mtry = tune(),
learn_rate = tune(),
stop_iter = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# grid
xgb_grid <- grid_latin_hypercube(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train),
learn_rate(),
stop_iter(range = c(5L,50L)),
size = 5
)
rec_grid <- grid_latin_hypercube(
parameters(rec) %>%
update(top_p = top_p(c(0,30))) ,
size = 5
)
comp_grid <- merge(xgb_grid, rec_grid)
model_metrics <- metric_set(roc_auc)
rs <- vfold_cv(cells)
ctrl <- control_grid(pkgs = "recipeselectors")
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
rfe_res <-
xgb_spec %>%
tune_grid(
preprocessor = rec,
resamples = rs,
grid = comp_grid,
control = ctrl
)
stopCluster(cl)
best <- rfe_res %>% select_best("roc_auc")
# finalize
final_mod <- finalize_model(xgb_spec, best)
final_rec <- finalize_recipe(rec, best)
# bakery
bake_test <- final_rec %>% prep() %>% bake(new_data = testing(split))
bake_train <- final_rec %>% prep() %>% bake(new_data = training(split))