usemodels 包中的 xgboost 代码片段将 one_hot 设置为 TRUE 是否有原因?
Is there a reason the xgboost code snippet from the usemodels package has one_hot set to TRUE?
xgboost 分类器的 recipe
代码片段有 one_hot = TRUE
是有原因的吗?这会创建“n”个虚拟变量而不是“n-1”。我通常将其设置为 FALSE,但只是想确保我没有遗漏任何东西。
代码-
data <- mtcars %>%
as_tibble() %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_xgboost(mpg ~ cyl, data = data)
输出-
xgboost_recipe <-
recipe(formula = mpg ~ cyl, data = data) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(28278)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
想法是,作为一个基于树的模型,xgboost 可以处理所有级别(与线性模型不同)并且如果您不包括所有类别,实际上可能需要更多拆分才能很好地适应。阅读 more about this here.
你不会看到 ranger 随机森林的相同之处,因为它可以原生处理因子。
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
cars <- as_tibble(mtcars) %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_ranger(mpg ~ cyl, data = cars)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
#> ranger_recipe <-
#> recipe(formula = mpg ~ cyl, data = cars)
#>
#> ranger_spec <-
#> rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
#> set_mode("regression") %>%
#> set_engine("ranger")
#>
#> ranger_workflow <-
#> workflow() %>%
#> add_recipe(ranger_recipe) %>%
#> add_model(ranger_spec)
#>
#> set.seed(54153)
#> ranger_tune <-
#> tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
由 reprex package (v2.0.0)
于 2021-04-07 创建
xgboost 分类器的 recipe
代码片段有 one_hot = TRUE
是有原因的吗?这会创建“n”个虚拟变量而不是“n-1”。我通常将其设置为 FALSE,但只是想确保我没有遗漏任何东西。
代码-
data <- mtcars %>%
as_tibble() %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_xgboost(mpg ~ cyl, data = data)
输出-
xgboost_recipe <-
recipe(formula = mpg ~ cyl, data = data) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(28278)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
想法是,作为一个基于树的模型,xgboost 可以处理所有级别(与线性模型不同)并且如果您不包括所有类别,实际上可能需要更多拆分才能很好地适应。阅读 more about this here.
你不会看到 ranger 随机森林的相同之处,因为它可以原生处理因子。
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
cars <- as_tibble(mtcars) %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_ranger(mpg ~ cyl, data = cars)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
#> ranger_recipe <-
#> recipe(formula = mpg ~ cyl, data = cars)
#>
#> ranger_spec <-
#> rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
#> set_mode("regression") %>%
#> set_engine("ranger")
#>
#> ranger_workflow <-
#> workflow() %>%
#> add_recipe(ranger_recipe) %>%
#> add_model(ranger_spec)
#>
#> set.seed(54153)
#> ranger_tune <-
#> tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
由 reprex package (v2.0.0)
于 2021-04-07 创建