Tidymodel 包：R 中的一般线性模型 (glm) 和决策树（袋装树、提升树和随机森林）模型

Question

问题

我正在尝试使用 R 中的 Tidymodels 包进行分析。我正在关注下面关于 R 中决策树学习的教程：-

教程

https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/

我有一个名为 FID 的数据框（见下文），其中 因变量 是 频率（数字)，预测变量为：- 年（数值）、月（因子）、季风（因子）和天（数值）。

我相信我已经成功学习了名为 “Tidymodels：R 中的决策树学习” 的教程，方法是构建 bagged 树、随机森林和 boosted 树 型号。

对于此分析，我还想构建一个一般线性模型 (glm) 以便在所有模型（即随机森林、袋装树、提升树和一般线性模型）来建立最佳模型拟合。所有模型都经过 10 倍交叉验证 以减少过度拟合的偏差。

问题

随后，我尝试调整教程中的代码（请参见下文）以适应 glm 模型，但我很困惑是否已适当调整模型。当我尝试在模型全部拟合后生成 rmse 值时，我不确定 glm R 代码的这个元素是否会生成以下错误消息：-

错误信息

Error: Problem with `mutate()` input `model`. x Input `model` can't be recycled to size 4. ℹ Input `model` is `c("bag", "rf", "boost")`. ℹ Input `model` must be size 4 or 1, not 3.

此外，我不确定在这些模型的 recipe() 函数 中表达的 R 代码是否充分或正确，这在 在拟合每个模型之前的处理步骤。从我的角度来看，我想知道是否可以改进模型的配方。

如果可能的话，我想知道是否有人可以帮助我解决拟合 glm 模型时的错误消息，并结合更正配方（如果有必要）。

我不是高级 R 编码员，我已经通过研究其他 Tidymodel 教程进行了彻底的尝试以尝试找到解决方案；但是，我不明白此错误消息的含义。因此，如果有人能够提供帮助，我谨表示最深切的谢意。

非常感谢。

R-代码

##Open the tidymodels package library(tidymodels) library(glmnet) library(parsnip) library(rpart.plot) library(rpart) library(tidyverse) # manipulating data library(skimr) # data visualization library(baguette) # bagged trees library(future) # parallel processing & decrease computation time library(xgboost) # boosted trees library(ranger) ########################################################### # Put 3/4 of the data into the training set #split this single dataset into two: a training set and a testing set data_split <- initial_split(Tidy_df, prop = 3/4) # Create data frames for the two sets: train_data <- training(data_split) test_data <- testing(data_split) # resample the data with 10-fold cross-validation (10-fold by default) cv <- vfold_cv(train_data) ########################################################### ##Produce the recipe ##Preprocessing ############################################################ rec <- recipe(Frequency ~ ., data = fid_df) %>% step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables ########################################################### ##Create Models ########################################################### ########################################################## ##General Linear Models ######################################################### ##glm mod_glm<-linear_reg(mode="regression", penalty = 0.1, mixture = 1) %>% set_engine("glmnet") ##Create workflow wflow_glm <- workflow() %>% add_recipe(rec) %>% add_model(mod_glm) ##Fit the model plan(multisession) fit_glm <- fit_resamples( wflow_glm, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE) ) ########################################################## ##Bagged Trees ########################################################## #####Bagged Trees mod_bag <- bag_tree() %>% set_mode("regression") %>% set_engine("rpart", times = 10) #10 bootstrap resamples ##Create workflow wflow_bag <- workflow() %>% add_recipe(rec) %>% add_model(mod_bag) ##Fit the model plan(multisession) fit_bag <- fit_resamples( wflow_bag, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE) ) ################################################### ##Random forests ################################################### mod_rf <-rand_forest(trees = 1e3) %>% set_engine("ranger", num.threads = parallel::detectCores(), importance = "permutation", verbose = TRUE) %>% set_mode("regression") ##Create Workflow wflow_rf <- workflow() %>% add_model(mod_rf) %>% add_recipe(rec) ##Fit the model plan(multisession) fit_rf<-fit_resamples( wflow_rf, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE) ) ############################################################ ##Boosted Trees ############################################################ mod_boost <- boost_tree() %>% set_engine("xgboost", nthreads = parallel::detectCores()) %>% set_mode("regression") ##Create workflow wflow_boost <- workflow() %>% add_recipe(rec) %>% add_model(mod_boost) ##Fit model plan(multisession) fit_boost <-fit_resamples( wflow_boost, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE) ) ############################################## ##Evaluate the models ############################################## collect_metrics(fit_bag) %>% bind_rows(collect_metrics(fit_rf)) %>% bind_rows(collect_metrics(fit_boost)) %>% bind_rows(collect_metrics(fit_glm)) %>% dplyr::filter(.metric == "rmse") %>% dplyr::mutate(model = c("bag", "rf", "boost")) %>% dplyr::select(model, everything()) %>% knitr::kable() ####Error message Error: Problem with `mutate()` input `model`. x Input `model` can't be recycled to size 4. ℹ Input `model` is `c("bag", "rf", "boost")`. ℹ Input `model` must be size 4 or 1, not 3. Run `rlang::last_error()` to see where the error occurred. ##################################################### ##Out-of-sample performance ##################################################### # bagged trees final_fit_bag <- last_fit( wflow_bag, split = split) # random forest final_fit_rf <- last_fit( wflow_rf, split = split) # boosted trees final_fit_boost <- last_fit( wflow_boost, split = split)

数据帧 - FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), class = "factor"), Monsoon = structure(c(2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon", "Second_Inter_Monsoon", "South_Monsson"), class = "factor"), Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")

Answer 1

我认为拟合线性模型的误差来自 Month 和 Monsoon 之间的关系。它们完全相关：

library(tidyverse) 

fid_df <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
                                  2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
                                  2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
                                  2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
                                                                                                 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
                                                                                                 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
                                                                                                 8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
                                                                                                                                    "April", "May", "June", "July", "August", "September", "October", 
                                                                                                                                    "November", "December"), class = "factor"), Monsoon = structure(c(2L, 
                                                                                                                                                                                                      2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 
                                                                                                                                                                                                      4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 
                                                                                                                                                                                                      3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon", 
                                                                                                                                                                                                                              "Second_Inter_Monsoon", "South_Monsson"), class = "factor"), 
                         Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 
                                       33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37, 
                                       41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31, 
                                                                                       28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30, 
                                                                                       7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26, 
                                                                                       29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")


fid_df %>%
  count(Month, Monsoon)
#>        Month              Monsoon n
#> 1    January        North_Monsoon 3
#> 2   February        North_Monsoon 3
#> 3      March First_Inter_Monssoon 3
#> 4      April First_Inter_Monssoon 3
#> 5        May        South_Monsson 3
#> 6       June        South_Monsson 3
#> 7       July        South_Monsson 3
#> 8     August        South_Monsson 3
#> 9  September        South_Monsson 3
#> 10   October Second_Inter_Monsoon 3
#> 11  November Second_Inter_Monsoon 3
#> 12  December        North_Monsoon 3

如果您在线性模型中使用这样的变量，该模型无法找到两组系数的估计值：

lm(Frequency ~ ., data = fid_df) %>% summary()
#> 
#> Call:
#> lm(formula = Frequency ~ ., data = fid_df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15.0008  -3.9357   0.6564   2.9769  12.7681 
#> 
#> Coefficients: (3 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)                 -7286.9226  3443.9292  -2.116   0.0459 *
#> Year                            3.6155     1.7104   2.114   0.0461 *
#> MonthFebruary                  -3.2641     6.6172  -0.493   0.6267  
#> MonthMarch                      0.1006     6.5125   0.015   0.9878  
#> MonthApril                      0.4843     6.5213   0.074   0.9415  
#> MonthMay                       -4.0308    11.0472  -0.365   0.7187  
#> MonthJune                       1.0135    15.5046   0.065   0.9485  
#> MonthJuly                      -2.6910    13.6106  -0.198   0.8451  
#> MonthAugust                    -4.9307     6.6172  -0.745   0.4641  
#> MonthSeptember                 -1.7105     7.1126  -0.240   0.8122  
#> MonthOctober                   -7.6981     6.5685  -1.172   0.2538  
#> MonthNovember                  -8.7484     6.5493  -1.336   0.1953  
#> MonthDecember                  -1.6981     6.5685  -0.259   0.7984  
#> MonsoonNorth_Monsoon                NA         NA      NA       NA  
#> MonsoonSecond_Inter_Monsoon         NA         NA      NA       NA  
#> MonsoonSouth_Monsson                NA         NA      NA       NA  
#> Days                            1.1510     0.4540   2.535   0.0189 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.968 on 22 degrees of freedom
#> Multiple R-squared:  0.8135, Adjusted R-squared:  0.7033 
#> F-statistic: 7.381 on 13 and 22 DF,  p-value: 2.535e-05

^{由 reprex package (v0.3.0.9001)}

于 2020-11-18 创建

既然你有这个信息，我会建议使用一些领域知识来决定是否在模型中使用 Month 或 Monsoon 但不是两者都使用.

Answer 2

根据 Julia Silge 的建议回答

#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe

rec <- recipe(Frequency_Blue ~ ., data = Tidy_df) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>% 
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>% 
            bind_rows(collect_metrics(fit_glm)) %>% 
              dplyr::filter(.metric == "rmse") %>% 
                dplyr::mutate(model = c("bag", "rf", "boost", "glm")) %>% 
                 dplyr::select(model, everything()) %>% 
                    knitr::kable()

##rmse values for all models

|model |.metric |.estimator |      mean|  n|  std_err|
|:-----|:-------|:----------|---------:|--:|--------:|
|bag   |rmse    |standard   |  8.929936| 10| 1.544587|
|rf    |rmse    |standard   | 10.188710| 10| 1.144354|
|boost |rmse    |standard   |  9.249904| 10| 1.489482|
|glm   |rmse    |standard   | 11.348420| 10| 2.217807|

#####################################################
##Out-of-sample performance
#####################################################
#glm

# bagged trees
final_fit_glm <- last_fit(
                     wflow_glm,
                        split = split)


# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)

Tidymodel 包：R 中的一般线性模型 (glm) 和决策树（袋装树、提升树和随机森林）模型

Tidymodel Package: General linear models (glm) and decision tree (bagged trees, boosted trees, and random forest) models in R

regression

r

decision-tree

glm

tidymodels