Tidymodels tune_grid:"Can't subset columns that don't exist" 不使用公式时
Tidymodels tune_grid: "Can't subset columns that don't exist" when not using formula
我已经为 TidyTuesday 上最近的咖啡数据集整理了一个数据预处理方法。我的目的是生成一个工作流,然后从那里调整一个超参数。我特别感兴趣的是通过各种 update_role()
下面的示例生成了一个与 prep
和 bake(coffee_test)
一起工作的食谱。如果我取消选择结果列,它甚至可以工作,例如。 coffee_recipe %>% bake(select(coffee_test, -cupper_points))
。但是,当我 运行 通过 tune_grid
的工作流程时,我得到了如图所示的错误。看起来 tune_grid
找不到没有“预测”角色的变量,即使 bake
现在,如果我改为使用公式和 step_rm
我不关心的变量以正常方式做事,那么事情大部分都有效 --- 我收到一些关于缺少行的警告 country_of_origin
coffee <- tidytuesdayR::tt_load(2020, week = 28)$coffee_ratings
#> [1] "total_cup_points" "species" "owner"
#> [4] "country_of_origin" "farm_name" "lot_number"
#> [7] "mill" "ico_number" "company"
#> [10] "altitude" "region" "producer"
#> [13] "number_of_bags" "bag_weight" "in_country_partner"
#> [16] "harvest_year" "grading_date" "owner_1"
#> [19] "variety" "processing_method" "aroma"
#> [22] "flavor" "aftertaste" "acidity"
#> [25] "body" "balance" "uniformity"
#> [28] "clean_cup" "sweetness" "cupper_points"
#> [31] "moisture" "category_one_defects" "quakers"
#> [34] "color" "category_two_defects" "expiration"
#> [37] "certification_body" "certification_address" "certification_contact"
#> [40] "unit_of_measurement" "altitude_low_meters" "altitude_high_meters"
#> [43] "altitude_mean_meters"
coffee_split <- initial_split(coffee, prop = 0.8)
coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)
coffee_recipe <- recipe(coffee_train) %>%
update_role(cupper_points, new_role = "outcome") %>%
variety, processing_method, country_of_origin,
aroma, flavor, aftertaste, acidity, sweetness, altitude_mean_meters,
new_role = "predictor"
) %>%
step_string2factor(all_nominal(), -all_outcomes()) %>%
country_of_origin, altitude_mean_meters,
impute_with = imp_vars(
in_country_partner, company, region, farm_name, certification_body
) %>%
step_unknown(variety, processing_method, new_level = "Unknown") %>%
step_other(country_of_origin, threshold = 0.01) %>%
step_other(processing_method, threshold = 0.10) %>%
step_other(variety, threshold = 0.10)
#> Data Recipe
#> Inputs:
#> role #variables
#> outcome 1
#> predictor 9
#> 33 variables with undeclared roles
#> Operations:
#> Factor variables from all_nominal(), -all_outcomes()
#> K-nearest neighbor imputation for country_of_origin, altitude_mean_meters
#> Unknown factor level assignment for variety, processing_method
#> Collapsing factor levels for country_of_origin
#> Collapsing factor levels for processing_method
#> Collapsing factor levels for variety
# This works just fine
coffee_recipe %>%
prep(coffee_train) %>%
bake(select(coffee_test, -cupper_points)) %>%
#> # A tibble: 6 x 42
#> total_cup_points species owner country_of_orig… farm_name lot_number mill
#> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 90.6 Arabica meta… Ethiopia metad plc <NA> meta…
#> 2 87.9 Arabica cqi … other <NA> <NA> <NA>
#> 3 87.9 Arabica grou… United States (… <NA> <NA> <NA>
#> 4 87.3 Arabica ethi… Ethiopia <NA> <NA> <NA>
#> 5 87.2 Arabica cqi … other <NA> <NA> <NA>
#> 6 86.9 Arabica ethi… Ethiopia <NA> <NA> <NA>
#> # … with 35 more variables: ico_number <fct>, company <fct>, altitude <fct>,
#> # region <fct>, producer <fct>, number_of_bags <dbl>, bag_weight <fct>,
#> # in_country_partner <fct>, harvest_year <fct>, grading_date <fct>,
#> # owner_1 <fct>, variety <fct>, processing_method <fct>, aroma <dbl>,
#> # flavor <dbl>, aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
#> # uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>, moisture <dbl>,
#> # category_one_defects <dbl>, quakers <dbl>, color <fct>,
#> # category_two_defects <dbl>, expiration <fct>, certification_body <fct>,
#> # certification_address <fct>, certification_contact <fct>,
#> # unit_of_measurement <fct>, altitude_low_meters <dbl>,
#> # altitude_high_meters <dbl>, altitude_mean_meters <dbl>
# Now let's try putting it into a workflow and running tune_grid
coffee_model <- rand_forest(trees = 500, mtry = tune()) %>%
set_engine("ranger") %>%
#> Random Forest Model Specification (regression)
#> Main Arguments:
#> mtry = tune()
#> trees = 500
#> Computational engine: ranger
coffee_workflow <- workflow() %>%
add_recipe(coffee_recipe) %>%
#> ══ Workflow ═══════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> ── Preprocessor ───────────────────────────────────────────────────────────────────────────────
#> 6 Recipe Steps
#> ● step_string2factor()
#> ● step_knnimpute()
#> ● step_unknown()
#> ● step_other()
#> ● step_other()
#> ● step_other()
#> ── Model ──────────────────────────────────────────────────────────────────────────────────────
#> Random Forest Model Specification (regression)
#> Main Arguments:
#> mtry = tune()
#> trees = 500
#> Computational engine: ranger
coffee_grid <- expand_grid(mtry = c(2, 5))
coffee_folds <- vfold_cv(coffee_train, v = 5)
coffee_workflow %>%
resamples = coffee_folds,
grid = coffee_grid
#> x Fold1: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold1: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold2: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold2: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold3: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold3: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold4: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold4: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold5: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold5: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> Warning: All models failed in tune_grid(). See the `.notes` column.
#> Warning: This tuning result has notes. Example notes on model fitting include:
#> model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> # Tuning results
#> # 5-fold cross-validation
#> # A tibble: 5 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [857/215]> Fold1 <NULL> <tibble [2 × 1]>
#> 2 <split [857/215]> Fold2 <NULL> <tibble [2 × 1]>
#> 3 <split [858/214]> Fold3 <NULL> <tibble [2 × 1]>
#> 4 <split [858/214]> Fold4 <NULL> <tibble [2 × 1]>
#> 5 <split [858/214]> Fold5 <NULL> <tibble [2 × 1]>
由 reprex package (v0.3.0)
于 2020-07-21 创建
此处发生错误是因为在 step_string2factor()
调整期间,配方开始尝试处理没有任何角色的变量,例如 species
和 owner
在选择结果和预测变量之前,尝试为 所有 名义变量设置角色。
coffee_recipe <- recipe(coffee_train) %>%
update_role(all_nominal(), new_role = "id") %>% ## ADD THIS
update_role(cupper_points, new_role = "outcome") %>%
variety, processing_method, country_of_origin,
aroma, flavor, aftertaste, acidity, sweetness, altitude_mean_meters,
new_role = "predictor"
) %>%
step_string2factor(all_nominal(), -all_outcomes()) %>%
country_of_origin, altitude_mean_meters,
impute_with = imp_vars(
in_country_partner, company, region, farm_name, certification_body
) %>%
step_unknown(variety, processing_method, new_level = "Unknown") %>%
step_other(country_of_origin, threshold = 0.01) %>%
step_other(processing_method, threshold = 0.10) %>%
step_other(variety, threshold = 0.10)
coffee <- tidytuesdayR::tt_load(2020, week = 28)$coffee_ratings
#> [1] "total_cup_points" "species" "owner"
#> [4] "country_of_origin" "farm_name" "lot_number"
#> [7] "mill" "ico_number" "company"
#> [10] "altitude" "region" "producer"
#> [13] "number_of_bags" "bag_weight" "in_country_partner"
#> [16] "harvest_year" "grading_date" "owner_1"
#> [19] "variety" "processing_method" "aroma"
#> [22] "flavor" "aftertaste" "acidity"
#> [25] "body" "balance" "uniformity"
#> [28] "clean_cup" "sweetness" "cupper_points"
#> [31] "moisture" "category_one_defects" "quakers"
#> [34] "color" "category_two_defects" "expiration"
#> [37] "certification_body" "certification_address" "certification_contact"
#> [40] "unit_of_measurement" "altitude_low_meters" "altitude_high_meters"
#> [43] "altitude_mean_meters"
coffee_split <- initial_split(coffee, prop = 0.8)
coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)
coffee_recipe <- recipe(coffee_train) %>%
update_role(cupper_points, new_role = "outcome") %>%
variety, processing_method, country_of_origin,
aroma, flavor, aftertaste, acidity, sweetness, altitude_mean_meters,
new_role = "predictor"
) %>%
step_string2factor(all_nominal(), -all_outcomes()) %>%
country_of_origin, altitude_mean_meters,
impute_with = imp_vars(
in_country_partner, company, region, farm_name, certification_body
) %>%
step_unknown(variety, processing_method, new_level = "Unknown") %>%
step_other(country_of_origin, threshold = 0.01) %>%
step_other(processing_method, threshold = 0.10) %>%
step_other(variety, threshold = 0.10)
#> Data Recipe
#> Inputs:
#> role #variables
#> outcome 1
#> predictor 9
#> 33 variables with undeclared roles
#> Operations:
#> Factor variables from all_nominal(), -all_outcomes()
#> K-nearest neighbor imputation for country_of_origin, altitude_mean_meters
#> Unknown factor level assignment for variety, processing_method
#> Collapsing factor levels for country_of_origin
#> Collapsing factor levels for processing_method
#> Collapsing factor levels for variety
# This works just fine
coffee_recipe %>%
prep(coffee_train) %>%
bake(select(coffee_test, -cupper_points)) %>%
#> # A tibble: 6 x 42
#> total_cup_points species owner country_of_orig… farm_name lot_number mill
#> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 90.6 Arabica meta… Ethiopia metad plc <NA> meta…
#> 2 87.9 Arabica cqi … other <NA> <NA> <NA>
#> 3 87.9 Arabica grou… United States (… <NA> <NA> <NA>
#> 4 87.3 Arabica ethi… Ethiopia <NA> <NA> <NA>
#> 5 87.2 Arabica cqi … other <NA> <NA> <NA>
#> 6 86.9 Arabica ethi… Ethiopia <NA> <NA> <NA>
#> # … with 35 more variables: ico_number <fct>, company <fct>, altitude <fct>,
#> # region <fct>, producer <fct>, number_of_bags <dbl>, bag_weight <fct>,
#> # in_country_partner <fct>, harvest_year <fct>, grading_date <fct>,
#> # owner_1 <fct>, variety <fct>, processing_method <fct>, aroma <dbl>,
#> # flavor <dbl>, aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
#> # uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>, moisture <dbl>,
#> # category_one_defects <dbl>, quakers <dbl>, color <fct>,
#> # category_two_defects <dbl>, expiration <fct>, certification_body <fct>,
#> # certification_address <fct>, certification_contact <fct>,
#> # unit_of_measurement <fct>, altitude_low_meters <dbl>,
#> # altitude_high_meters <dbl>, altitude_mean_meters <dbl>
# Now let's try putting it into a workflow and running tune_grid
coffee_model <- rand_forest(trees = 500, mtry = tune()) %>%
set_engine("ranger") %>%
#> Random Forest Model Specification (regression)
#> Main Arguments:
#> mtry = tune()
#> trees = 500
#> Computational engine: ranger
coffee_workflow <- workflow() %>%
add_recipe(coffee_recipe) %>%
#> ══ Workflow ═══════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> ── Preprocessor ───────────────────────────────────────────────────────────────────────────────
#> 6 Recipe Steps
#> ● step_string2factor()
#> ● step_knnimpute()
#> ● step_unknown()
#> ● step_other()
#> ● step_other()
#> ● step_other()
#> ── Model ──────────────────────────────────────────────────────────────────────────────────────
#> Random Forest Model Specification (regression)
#> Main Arguments:
#> mtry = tune()
#> trees = 500
#> Computational engine: ranger
coffee_grid <- expand_grid(mtry = c(2, 5))
coffee_folds <- vfold_cv(coffee_train, v = 5)
coffee_workflow %>%
resamples = coffee_folds,
grid = coffee_grid
#> x Fold1: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold1: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold2: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold2: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold3: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold3: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold4: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold4: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold5: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold5: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> Warning: All models failed in tune_grid(). See the `.notes` column.
#> Warning: This tuning result has notes. Example notes on model fitting include:
#> model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> # Tuning results
#> # 5-fold cross-validation
#> # A tibble: 5 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [857/215]> Fold1 <NULL> <tibble [2 × 1]>
#> 2 <split [857/215]> Fold2 <NULL> <tibble [2 × 1]>
#> 3 <split [858/214]> Fold3 <NULL> <tibble [2 × 1]>
#> 4 <split [858/214]> Fold4 <NULL> <tibble [2 × 1]>
#> 5 <split [858/214]> Fold5 <NULL> <tibble [2 × 1]>
