Tidymodels (Fitting Bagged Trees with 10-Fold Cross Validation in R): x Fold01: model: Error: Input must be a vector, not NULL
Tidymodels (Fitting Bagged Trees with 10-Fold Cross Validation in R): x Fold01: model: Error: Input must be a vector, not NULL
概览:
我使用带有 数据框 FID 的 tidymodels 包生成了四个模型(见下文):
- 一般线性模型
- 袋装树
- 随机森林
- 增强树
数据框包含三个预测变量:
- 年份(数字)
- 月份(因素)
- 天(数字)
因变量是频率(数值)
问题
我正在尝试拟合袋装树模型,但遇到下面的错误消息:
知道为什么我在使用 bag_tree() 和 fit_resamples() 时出错吗?
网上没有多少material,除了我找到这个post;然而,这个问题与逻辑回归有关,而不是袋装树模型。
x Fold01: model: Error: Input must be a vector, not NULL.
x Fold02: model: Error: Input must be a vector, not NULL.
x Fold03: model: Error: Input must be a vector, not NULL.
x Fold04: model: Error: Input must be a vector, not NULL.
x Fold05: model: Error: Input must be a vector, not NULL.
x Fold06: model: Error: Input must be a vector, not NULL.
x Fold07: model: Error: Input must be a vector, not NULL.
x Fold08: model: Error: Input must be a vector, not NULL.
x Fold09: model: Error: Input must be a vector, not NULL.
x Fold10: model: Error: Input must be a vector, not NULL.
Warning message:
All models failed in [fit_resamples()]. See the `.notes` column.
如果有人可以帮助解决此错误消息,我将非常感谢您的建议。
非常感谢
R-code
##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)
library(rlang)
library(poissonreg)
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
#####Bagged Trees
mod_bag <- bag_tree() %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
##Update the model with cost complexity
##A positive number for the cost/complexity parameter, and
##The cost/complexity parameter
Updated_bag<-update(mod_bag, cost_complexity=1)
##Create workflow
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(Updated_bag)
##Fit and predict the general linear model
bag_fit_model <- fit(wflow_bag, data = train_data)
##We can access the fit using pull_workflow_fit(), and even
##tidy() the model coefficient results into a convenient dataframe format.
##Whosebug
bag_fit_model %>%
pull_workflow_fit()
##Predict the model
bag_predict<-predict(bag_fit_model, train_data)
##Fit the model
plan(multisession)
fit_bag <- fit_resamples(
wflow_bag,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
x Fold01: model: Error: Input must be a vector, not NULL.
x Fold02: model: Error: Input must be a vector, not NULL.
x Fold03: model: Error: Input must be a vector, not NULL.
x Fold04: model: Error: Input must be a vector, not NULL.
x Fold05: model: Error: Input must be a vector, not NULL.
x Fold06: model: Error: Input must be a vector, not NULL.
x Fold07: model: Error: Input must be a vector, not NULL.
x Fold08: model: Error: Input must be a vector, not NULL.
x Fold09: model: Error: Input must be a vector, not NULL.
x Fold10: model: Error: Input must be a vector, not NULL.
Warning message:
All models failed in [fit_resamples()]. See the `.notes` column.
数据帧 - FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
决策树的cost_complexity
有时被称为alpha
,它应该是一个小于1的正数。当 cost_complexity
小于 1 时,您的模型运行良好:
library(tidymodels)
library(baguette)
FID <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017),
Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L),
.Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"),
Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38),
Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v = 10)
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
mod_bag <- bag_tree(cost_complexity = 0.1) %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_bag)
fit(wflow_bag, data = train_data)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bag_tree()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#>
#> ● step_nzv()
#> ● step_novel()
#> ● step_medianimpute()
#> ● step_dummy()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Bagged CART (regression with 10 members)
#>
#> Variable importance scores include:
#>
#> # A tibble: 12 x 4
#> term value std.error used
#> <chr> <dbl> <dbl> <int>
#> 1 Days 4922. 369. 10
#> 2 Month_June 2253. 260. 9
#> 3 Month_July 1375. 139. 8
#> 4 Month_November 306. 96.4 3
#> 5 Year 272. 519. 2
#> 6 Month_May 270. 103. 4
#> 7 Month_February 191. 116. 4
#> 8 Month_August 105. 30.2 3
#> 9 Month_April 45.8 42.5 2
#> 10 Month_September 13.4 0 1
#> 11 Month_December 11.9 0 1
#> 12 Month_March 10.1 0 1
由 reprex package (v0.3.0.9001)
于 2020-12-17 创建
我敢打赌您尝试了 1 的值,因为显示的是 in the docs here,这是非常具有误导性的。我们会解决这个问题。
概览:
我使用带有 数据框 FID 的 tidymodels 包生成了四个模型(见下文):
- 一般线性模型
- 袋装树
- 随机森林
- 增强树
数据框包含三个预测变量:
- 年份(数字)
- 月份(因素)
- 天(数字)
因变量是频率(数值)
问题
我正在尝试拟合袋装树模型,但遇到下面的错误消息:
知道为什么我在使用 bag_tree() 和 fit_resamples() 时出错吗?
网上没有多少material,除了我找到这个post;然而,这个问题与逻辑回归有关,而不是袋装树模型。
x Fold01: model: Error: Input must be a vector, not NULL.
x Fold02: model: Error: Input must be a vector, not NULL.
x Fold03: model: Error: Input must be a vector, not NULL.
x Fold04: model: Error: Input must be a vector, not NULL.
x Fold05: model: Error: Input must be a vector, not NULL.
x Fold06: model: Error: Input must be a vector, not NULL.
x Fold07: model: Error: Input must be a vector, not NULL.
x Fold08: model: Error: Input must be a vector, not NULL.
x Fold09: model: Error: Input must be a vector, not NULL.
x Fold10: model: Error: Input must be a vector, not NULL.
Warning message:
All models failed in [fit_resamples()]. See the `.notes` column.
如果有人可以帮助解决此错误消息,我将非常感谢您的建议。
非常感谢
R-code
##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)
library(rlang)
library(poissonreg)
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
#####Bagged Trees
mod_bag <- bag_tree() %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
##Update the model with cost complexity
##A positive number for the cost/complexity parameter, and
##The cost/complexity parameter
Updated_bag<-update(mod_bag, cost_complexity=1)
##Create workflow
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(Updated_bag)
##Fit and predict the general linear model
bag_fit_model <- fit(wflow_bag, data = train_data)
##We can access the fit using pull_workflow_fit(), and even
##tidy() the model coefficient results into a convenient dataframe format.
##Whosebug
bag_fit_model %>%
pull_workflow_fit()
##Predict the model
bag_predict<-predict(bag_fit_model, train_data)
##Fit the model
plan(multisession)
fit_bag <- fit_resamples(
wflow_bag,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
x Fold01: model: Error: Input must be a vector, not NULL.
x Fold02: model: Error: Input must be a vector, not NULL.
x Fold03: model: Error: Input must be a vector, not NULL.
x Fold04: model: Error: Input must be a vector, not NULL.
x Fold05: model: Error: Input must be a vector, not NULL.
x Fold06: model: Error: Input must be a vector, not NULL.
x Fold07: model: Error: Input must be a vector, not NULL.
x Fold08: model: Error: Input must be a vector, not NULL.
x Fold09: model: Error: Input must be a vector, not NULL.
x Fold10: model: Error: Input must be a vector, not NULL.
Warning message:
All models failed in [fit_resamples()]. See the `.notes` column.
数据帧 - FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
决策树的cost_complexity
有时被称为alpha
,它应该是一个小于1的正数。当 cost_complexity
小于 1 时,您的模型运行良好:
library(tidymodels)
library(baguette)
FID <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017),
Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L),
.Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"),
Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38),
Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v = 10)
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
mod_bag <- bag_tree(cost_complexity = 0.1) %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_bag)
fit(wflow_bag, data = train_data)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bag_tree()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#>
#> ● step_nzv()
#> ● step_novel()
#> ● step_medianimpute()
#> ● step_dummy()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Bagged CART (regression with 10 members)
#>
#> Variable importance scores include:
#>
#> # A tibble: 12 x 4
#> term value std.error used
#> <chr> <dbl> <dbl> <int>
#> 1 Days 4922. 369. 10
#> 2 Month_June 2253. 260. 9
#> 3 Month_July 1375. 139. 8
#> 4 Month_November 306. 96.4 3
#> 5 Year 272. 519. 2
#> 6 Month_May 270. 103. 4
#> 7 Month_February 191. 116. 4
#> 8 Month_August 105. 30.2 3
#> 9 Month_April 45.8 42.5 2
#> 10 Month_September 13.4 0 1
#> 11 Month_December 11.9 0 1
#> 12 Month_March 10.1 0 1
由 reprex package (v0.3.0.9001)
于 2020-12-17 创建我敢打赌您尝试了 1 的值,因为显示的是 in the docs here,这是非常具有误导性的。我们会解决这个问题。