如何 select 多个模型 workflow_set (tidymodels) 基于不同的指标

Question

我运行正确地选择了以下模型，我需要选择最好的两个（对于一个或多个指标）。模型之间的区别在于配方对象对不平衡数据采取不同的步骤（没有、smote、rose、upsample、step_adasyn）。我对 select 不止一个感兴趣，最好的两个还有 select 不平衡的功能。

                      yardstick::sensitivity, yardstick::specificity, 
                      yardstick::precision, yardstick::recall )
folds <- vfold_cv(data_train, v = 3, strata = class)

rec_obj_all <- data_train %>% 
  recipe(class ~ .) %>%
  step_naomit(everything(), skip = TRUE) %>% 
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors()) 

rec_obj_all_s <- data_train %>% 
  recipe(class ~ .) %>%
  step_naomit(everything(), skip = TRUE) %>% 
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_smote(class)

rec_obj_all_r <- data_train %>% 
  recipe(class ~ .) %>%
  step_naomit(everything(), skip = TRUE) %>% 
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors())  %>%
  step_rose(class)

rec_obj_all_up <- data_train %>% 
  recipe(clas ~ .) %>%
  step_naomit(everything(), skip = TRUE) %>% 
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_upsample(class)

rec_obj_all_ad <- data_train %>% 
  recipe(class ~ .) %>%
  step_naomit(everything(), skip = TRUE) %>% 
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_adasyn(class)

lasso_mod1 <- logistic_reg(penalty = tune(),
                          mixture = 1) %>%
  set_engine("glmnet")

tictoc::tic()

all_cores <- parallel::detectCores(logical = FALSE)
library(doFuture)
registerDoFuture()
cl <- parallel::makeCluster(all_cores-4)
plan(cluster, workers = cl)

balances <- 
  workflow_set(
    preproc = list(unba = rec_obj_all, b_sm = rec_obj_all_s, b_ro = rec_obj_all_r,
                   b_up = rec_obj_all_up, b_ad = rec_obj_all_ad), 
    models = list(lasso_mod1),
    cross = TRUE
  )

grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = FALSE
  )

grid_results <-
  balances %>%
  workflow_map(
    seed = 1503,
    resamples = folds,
    grid = 25,
    metrics = metrics_lasso,
    control = grid_ctrl,
    verbose = TRUE)
    

parallel::stopCluster( cl )

tictoc::toc()```

I don´t understand what is the correspond function to select the best two or more models with the package workflowsets.

Answer 1

有convenience functions in workflowsets to rank results and extract the best results, but if you have more specific use cases like you describe here (best two, or best based on more complex filtering) then go ahead and use tidyr + dplyr verbs to handle your results in grid_results. You can unnest() and/or use the results of rank_results()可以把你感兴趣的搞出来

如何 select 多个模型 workflow_set (tidymodels) 基于不同的指标

How how to select more than one model with workflow_set (tidymodels) based on different metrics

workflow

r

classification

glmnet

tidymodels