Write a workflow for classification using tidymodels. Get "Error: Column `.row` must be length.."
Write a workflow for classification using tidymodels. Get "Error: Column `.row` must be length.."
我想建立一个正则化逻辑回归模型来预测 OneR 包中发现的乳腺癌数据集中的 Class。我想使用 tidymodels 框架将这一切放入一个简洁的工作流程中。
library(tidymodels)
library(OneR)
#specify model
bc.lr = logistic_reg(
mode="classification",
penalty = tune(),
mixture=1
) %>%
set_engine("glmnet")
#tune penalty term using 4-fold cv
cv_splits<-vfold_cv(breastcancer,v=4,strata="Class")
#simple recipe to scale all predictors and remove observations with NAs
bc.recipe <- recipe (Class ~., data = breastcancer) %>%
step_normalize(all_predictors()) %>%
step_naomit(all_predictors(), all_outcomes()) %>%
prep()
#set up a grid of tuning parameters
tuning_grid = grid_regular(penalty(range = c(0, 0.5)),
levels = 10,
original = F)
#put everything together into a workflow
bc.wkfl <- workflow() %>%
add_recipe(bc.recipe) %>%
add_model(bc.lr)
#model fit
tune = tune_grid(bc.wkfl,
resample = cv_splits,
grid = tuning_grid,
metrics = metric_set(accuracy),
control = control_grid(save_pred = T))
当我尝试调用 tune_grid 时出现奇怪的错误。
Fold1: model 1/1 (predictions): Error: Column `.row` must be length ....
这里的问题是配方步骤对 NA
值的处理。这是你需要仔细考虑的步骤"skipping"。来自那篇文章:
When doing resampling or a training/test split, certain operations make sense for the data to be used for modeling but are problematic for new samples or the test set.
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.6 ✓ rsample 0.0.6
#> ✓ dplyr 0.8.5 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.0 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.1
#> ✓ parsnip 0.1.1 ✓ yardstick 0.0.6
#> ✓ purrr 0.3.4
#> ── Conflicts ───────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
library(OneR)
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
## cross validation split
cancer_splits <- vfold_cv(breastcancer, v = 4, strata = Class)
## preprocessing recipe (note skip = TRUE)
cancer_rec <- recipe(Class ~ ., data = breastcancer) %>%
step_naomit(all_predictors(), skip = TRUE) %>%
step_normalize(all_predictors())
## grid of tuning parameters
tuning_grid <- grid_regular(penalty(),
levels = 10)
## put everything together into a workflow
cancer_wf <- workflow() %>%
add_recipe(cancer_rec) %>%
add_model(lasso_spec)
## fit
cancer_res <- tune_grid(
cancer_wf,
resamples = cancer_splits,
grid = tuning_grid,
control = control_grid(save_pred = TRUE)
)
cancer_res
#> # 4-fold cross-validation using stratification
#> # A tibble: 4 x 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [523/176]> Fold1 <tibble [20 × 4]> <tibble [0 × 1]> <tibble [1,760 × 6…
#> 2 <split [524/175]> Fold2 <tibble [20 × 4]> <tibble [0 × 1]> <tibble [1,750 × 6…
#> 3 <split [525/174]> Fold3 <tibble [20 × 4]> <tibble [0 × 1]> <tibble [1,740 × 6…
#> 4 <split [525/174]> Fold4 <tibble [20 × 4]> <tibble [0 × 1]> <tibble [1,740 × 6…
由 reprex package (v0.3.0)
于 2020-05-14 创建
请注意,设置 skip = TRUE
允许您以适当的方式处理新数据的 NA
值。
我想建立一个正则化逻辑回归模型来预测 OneR 包中发现的乳腺癌数据集中的 Class。我想使用 tidymodels 框架将这一切放入一个简洁的工作流程中。
library(tidymodels)
library(OneR)
#specify model
bc.lr = logistic_reg(
mode="classification",
penalty = tune(),
mixture=1
) %>%
set_engine("glmnet")
#tune penalty term using 4-fold cv
cv_splits<-vfold_cv(breastcancer,v=4,strata="Class")
#simple recipe to scale all predictors and remove observations with NAs
bc.recipe <- recipe (Class ~., data = breastcancer) %>%
step_normalize(all_predictors()) %>%
step_naomit(all_predictors(), all_outcomes()) %>%
prep()
#set up a grid of tuning parameters
tuning_grid = grid_regular(penalty(range = c(0, 0.5)),
levels = 10,
original = F)
#put everything together into a workflow
bc.wkfl <- workflow() %>%
add_recipe(bc.recipe) %>%
add_model(bc.lr)
#model fit
tune = tune_grid(bc.wkfl,
resample = cv_splits,
grid = tuning_grid,
metrics = metric_set(accuracy),
control = control_grid(save_pred = T))
当我尝试调用 tune_grid 时出现奇怪的错误。
Fold1: model 1/1 (predictions): Error: Column `.row` must be length ....
这里的问题是配方步骤对 NA
值的处理。这是你需要仔细考虑的步骤"skipping"。来自那篇文章:
When doing resampling or a training/test split, certain operations make sense for the data to be used for modeling but are problematic for new samples or the test set.
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.6 ✓ rsample 0.0.6
#> ✓ dplyr 0.8.5 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.0 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.1
#> ✓ parsnip 0.1.1 ✓ yardstick 0.0.6
#> ✓ purrr 0.3.4
#> ── Conflicts ───────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
library(OneR)
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
## cross validation split
cancer_splits <- vfold_cv(breastcancer, v = 4, strata = Class)
## preprocessing recipe (note skip = TRUE)
cancer_rec <- recipe(Class ~ ., data = breastcancer) %>%
step_naomit(all_predictors(), skip = TRUE) %>%
step_normalize(all_predictors())
## grid of tuning parameters
tuning_grid <- grid_regular(penalty(),
levels = 10)
## put everything together into a workflow
cancer_wf <- workflow() %>%
add_recipe(cancer_rec) %>%
add_model(lasso_spec)
## fit
cancer_res <- tune_grid(
cancer_wf,
resamples = cancer_splits,
grid = tuning_grid,
control = control_grid(save_pred = TRUE)
)
cancer_res
#> # 4-fold cross-validation using stratification
#> # A tibble: 4 x 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [523/176]> Fold1 <tibble [20 × 4]> <tibble [0 × 1]> <tibble [1,760 × 6…
#> 2 <split [524/175]> Fold2 <tibble [20 × 4]> <tibble [0 × 1]> <tibble [1,750 × 6…
#> 3 <split [525/174]> Fold3 <tibble [20 × 4]> <tibble [0 × 1]> <tibble [1,740 × 6…
#> 4 <split [525/174]> Fold4 <tibble [20 × 4]> <tibble [0 × 1]> <tibble [1,740 × 6…
由 reprex package (v0.3.0)
于 2020-05-14 创建请注意,设置 skip = TRUE
允许您以适当的方式处理新数据的 NA
值。