如何在decision_tree spec 中设置拆分规则?

How to set the splitting rule in decision_tree spec?

使用 tidymodels 元包和 decision_tree() 函数创建规范并拟合决策树时,rpart 包中分类数据的默认拆分 method/rule 是 Gini index,用rpart::rpart().

的params参数设置

此外,使用 ranger 引擎创建随机森林模型对分类数据使用相同的默认值。我的问题是:如何将拆分方法更改为信息增益或香农熵?

这是一个示例(关注 str() 调用和 formas_forest_fit 对象以查看拆分规则)

# install.packages(c("tidymodels", "rpart", "ranger"))
library(tidymodels)

formas <- tibble(
  Color = c("Rojo", "Azul", "Rojo", "Verde", "Rojo", "Verde"), 
  Forma = c("Cuadrado", "Cuadrado", "Redondo", "Cuadrado", "Redondo", "Cuadrado"), 
  `Tamaño` = c("Grande", "Grande", "Pequeño", "Pequeño", "Grande", "Grande"), 
  Compra = structure(c(2L, 2L, 1L, 1L, 2L, 1L), .Label = c("No", "Si"), class = "factor")
)

# Tree spec and fit -----------------------
formas_tree_spec <- 
  decision_tree(min_n = 2) %>% 
  set_mode("classification") %>% 
  set_engine("rpart")

formas_tree_fit <- 
  fit(
    formas_tree_spec, 
    data = formas, 
    formula = Compra ~ .
  )

# Forest spec and fit ----------------------
formas_forest_spec <- 
  rand_forest(trees = 5000, min_n = 2) %>% 
  set_mode("classification") %>% 
  set_engine("ranger") 

formas_forest_fit <- 
  fit(
    formas_forest_spec, 
    data = formas, 
    formula = Compra ~ .
  )

str(rpart::rpart)
str(ranger::ranger)
formas_forest_fit

Emil Hvidfeldt's suggestion之后,set_engine()函数接受我们直接向引擎函数传递参数。

这是具有信息增益分裂规则的树:

formas_tree_spec <- 
  decision_tree(min_n = 2) %>% 
  set_mode("classification") %>% 
  set_engine("rpart", parms = list(split = "information")