usemodels 包中的 xgboost 代码片段将 one_hot 设置为 TRUE 是否有原因?

Is there a reason the xgboost code snippet from the usemodels package has one_hot set to TRUE?

xgboost 分类器的 recipe 代码片段有 one_hot = TRUE 是有原因的吗?这会创建“n”个虚拟变量而不是“n-1”。我通常将其设置为 FALSE,但只是想确保我没有遗漏任何东西。

代码-

data <- mtcars %>% 
  as_tibble() %>%  
  mutate(cyl = cyl %>% as.factor)

usemodels::use_xgboost(mpg ~ cyl, data = data)

输出-

xgboost_recipe <- 
  recipe(formula = mpg ~ cyl, data = data) %>% 
  step_novel(all_nominal(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) 

xgboost_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), 
    loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

set.seed(28278)
xgboost_tune <-
  tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))

想法是,作为一个基于树的模型,xgboost 可以处理所有级别(与线性模型不同)并且如果您不包括所有类别,实际上可能需要更多拆分才能很好地适应。阅读 more about this here.

不会看到 ranger 随机森林的相同之处,因为它可以原生处理因子。

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
cars <- as_tibble(mtcars) %>%  
  mutate(cyl = cyl %>% as.factor)

usemodels::use_ranger(mpg ~ cyl, data = cars)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
#> ranger_recipe <- 
#>   recipe(formula = mpg ~ cyl, data = cars) 
#> 
#> ranger_spec <- 
#>   rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
#>   set_mode("regression") %>% 
#>   set_engine("ranger") 
#> 
#> ranger_workflow <- 
#>   workflow() %>% 
#>   add_recipe(ranger_recipe) %>% 
#>   add_model(ranger_spec) 
#> 
#> set.seed(54153)
#> ranger_tune <-
#>   tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))

reprex package (v2.0.0)

于 2021-04-07 创建