使用 R `recipes` 包预处理数据:如何按数字列中的模式进行估算(以使用 xgboost 拟合模型)?

Preprocessing data with R `recipes` package: how to impute by mode in numeric columns (to fit model with xgboost)?

我想将 xgboost 用于 class 化问题,并且两个预测变量(几个)是二进制列,它们也恰好有一些缺失值。在使用 xgboost 拟合模型之前,我想通过在每个二进制列中输入模式来替换那些缺失值。

我的问题是我想将此插补作为 tidymodelsrecipe". That is, not using typical data wrangling procedures such as dplyr/tidyr/data.table, etc. Doing the imputation within a recipe should guard against "information leakage”的一部分。

尽管 recipes 包提供了许多 step_*() 专为数据预处理而设计的函数,但我找不到通过 mode[=62= 进行所需插补的方法] 在数字二进制列上。虽然 一个名为 step_impute_mode() 的函数,但它只接受标称变量(即 class factorcharacter)。但我需要我的二进制列保持数字,以便它们可以传递给 xgboost 引擎。

考虑以下玩具示例。我从 this reference page 中获取它并稍微更改了数据以反映问题。

创建玩具数据

# install.packages("xgboost")
library(tidymodels)
tidymodels_prefer()

# original data shipped with package
data(two_class_dat)

# simulating 2-column binary data + NAs
n_rows <- nrow(two_class_dat)

df_x1_x2 <-
  data.frame(x1 = rbinom(n_rows, 1, runif(1)),
             x2 = rbinom(n_rows, 1, runif(1)))

## randomly replace 25% of each column with NAs
df_x1_x2[c("x1", "x2")] <-
  lapply(df_x1_x2[c("x1", "x2")], function(x) {
    x[sample(seq_along(x), 0.25 * length(x))] <- NA
    x
  })

# bind original data & simulated data
df_to_xgboost <- cbind(two_class_dat, df_x1_x2)

# split data to training and testing
data_train <- df_to_xgboost[-(1:10), ]
data_test  <- df_to_xgboost[  1:10 , ]

使用 tidymodels 工具设置模型规范和预处理方法

# model specification
xgb_spec <- 
  boost_tree(trees = 15) %>% 
  # This model can be used for classification or regression, so set mode
  set_mode("classification") %>% 
  set_engine("xgboost")

# preprocessing recipe
xgb_recipe <-
  recipe(formula = Class ~ ., data = data_train) %>%
  step_bin2factor(x1, x2) %>% # <-~-~-~-~-~-~-~-~-~-~-~-~-~| these 2 lines are the heart of the problem
  step_impute_mode(x1, x2)    # <-~-~-~-~-~-~-~-~-~-~-~-~-~| I can't impute unless I first convert columns from numeric to factor/chr. 
#                                                          | But once I do, xgboost fails with non-numeric data. 
#                                                          | There isn't `step_*()` for converting back to numeric (like as.numeric())                      


# bind `xgb_spec` and `xgb_recipe` into a workflow object
xgb_wflow <-
  workflow() %>%
  add_recipe(xgb_recipe) %>% 
  add_model(xgb_spec)

拟合模型

fit(xgb_wflow, data_train)
#> Error in xgboost::xgb.DMatrix(x, label = y, missing = NA): 'data' has class 'character' and length 3124.
#>   'data' accepts either a numeric matrix or a single filename.
#> Timing stopped at: 0 0 0

拟合失败,因为 data_train$x1data_train$x2 成为每个 step_bin2factor(x1, x2) 的因子。这就是我目前的收获:一方面,除非所有数据都是数字,否则我无法适应 xgboost 模型;另一方面,除非数据是 factor/chr.

,否则我无法按模式估算

虽然 build custom step_*() functions 的方法,但它有点复杂。所以我首先想联系一下,看看是否有我可能遗漏的简单解决方案。我认为我目前使用 xgboost 和二进制预测器的情况似乎很主流,我不想重新发明轮子。

归功于回答here的用户@gus:

xgb_recipe <-
  recipe(formula = Class ~ ., data = data_train) %>%
  step_num2factor(c(x1, x2),
                  transform = function(x) x + 1,
                  levels = c("0", "1")) %>%
  step_impute_mode(x1, x2) %>%
  step_mutate_at(c(x1, x2), fn = ~ as.numeric(.) - 1)

整个代码

# install.packages("xgboost")
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
tidymodels_prefer()

data(two_class_dat)

n_rows <- nrow(two_class_dat)

df_x1_x2 <-
  data.frame(x1 = rbinom(n_rows, 1, runif(1)),
             x2 = rbinom(n_rows, 1, runif(1)))

df_x1_x2[c("x1", "x2")] <-
  lapply(df_x1_x2[c("x1", "x2")], function(x) {
    x[sample(seq_along(x), 0.25 * length(x))] <- NA
    x
  })

df_to_xgboost <- cbind(two_class_dat, df_x1_x2)
### 
data_train <- df_to_xgboost[-(1:10), ]
data_test  <- df_to_xgboost[  1:10 , ]

xgb_spec <- 
  boost_tree(trees = 15) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")

xgb_recipe <-
  recipe(formula = Class ~ ., data = data_train) %>%
  step_num2factor(c(x1, x2),
                  transform = function(x) x + 1,
                  levels = c("0", "1")) %>%
  step_impute_mode(x1, x2) %>%
  step_mutate_at(c(x1, x2), fn = ~ as.numeric(.) - 1)

xgb_recipe %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 781 x 5
#>        A     B    x1    x2 Class 
#>    <dbl> <dbl> <dbl> <dbl> <fct> 
#>  1 1.44  1.68      1     1 Class1
#>  2 2.34  2.32      1     1 Class2
#>  3 2.65  1.88      0     1 Class2
#>  4 0.849 0.813     1     1 Class1
#>  5 3.25  0.869     1     1 Class1
#>  6 1.05  0.845     0     1 Class1
#>  7 0.886 0.489     1     0 Class1
#>  8 2.91  1.54      1     1 Class1
#>  9 3.14  2.06      1     1 Class2
#> 10 1.04  0.886     1     1 Class2
#> # ... with 771 more rows

xgb_wflow <-
  workflow() %>%
  add_recipe(xgb_recipe) %>% 
  add_model(xgb_spec)

fit(xgb_wflow, data_train)
#> [09:35:36] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> -- Preprocessor ----------------------------------------------------------------
#> 3 Recipe Steps
#> 
#> * step_num2factor()
#> * step_impute_mode()
#> * step_mutate_at()
#> 
#> -- Model -----------------------------------------------------------------------
#> ##### xgb.Booster
#> raw: 59.4 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1, objective = "binary:logistic"), data = x$data, 
#>     nrounds = 15, watchlist = x$watchlist, verbose = 0, nthread = 1)
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", objective = "binary:logistic", nthread = "1", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.evaluation.log()
#> # of features: 4 
#> niter: 15
#> nfeatures : 4 
#> evaluation_log:
#>     iter training_logloss
#>        1         0.551974
#>        2         0.472546
#> ---                      
#>       14         0.251547
#>       15         0.245090

reprex package (v2.0.1.9000)

于 2021-12-25 创建