Problem Fitting Model in Tidymodels: Error: ! Can't use NA as column index with `[` at positions 3 and 4:\

Question

我正在 Tidymodels 写一个项目。我创建了一个train和test集，列了一个recipe和一个model。当我调用 workflow()、添加 recipe 和 model，然后调用 fit(data = df_train 时，出现以下错误。

Error:
! Can't use NA as column index with `[` at positions 3 and 4.

我使用的是 R 版本 4.1.3 和 R Studio 2022.02.0 Build 443。

为了可重复性，这里是工作流程。请注意，数据在 GitHub 上，因此您需要互联网连接才能加载数据。

## Load package manager

if(!require(pacman)){
  
  install.packages("pacman")
  
}

## Load required packages. Download them if they do not exist in my system.

pacman::p_load(tidyverse, kableExtra, skimr, knitr, glue, GGally, 
               
               corrplot, tidymodels, themis, stargazer, rpart, rpart.plot, 
               
               vip, patchwork, data.table)

下一步将加载数据。

df <- fread('https://raw.githubusercontent.com/Karuitha/data_projects/master/employee_turnover/data/employee_churn_data.csv') %>%

  mutate(left = factor(left, levels = c("yes", "no")))

接下来，我将数据分成训练集和测试集并创建食谱。

## Create a split object consisting 75% of data
split_object <- initial_split(df, prop = 0.75, 
                              
                              strata = left)

## Generate the training set
df_train <- split_object %>%
  
  training()

## Generate the testing set
df_test <- split_object %>%
  
  testing()

###############################################
## Create a recipe
df_recipe <- recipes::recipe(left ~ ., 
                             
                             data = df_train) %>%
  
  ##We upsample the data to balance the outcome variable
  themis::step_upsample(left, 
                        
                        over_ratio = 1, 
                        
                        seed = 500) %>%
  
  ##We make all character variables factors
  step_string2factor(all_nominal_predictors()) %>%
  
  ##We remove one in a pair of highly correlated variables
  ## The threshold for removal is 0.85 (absolute) 
  ## The choice of threshold is subjective. 
  step_corr(all_numeric_predictors(), 
            
            threshold = 0.85) %>%
  
  ## Train these steps on the training data
  prep(training = df_train)

接下来，我定义一个模型并尝试拟合。

## Define a logistic model
logistic_model <- logistic_reg() %>%
  
  set_engine("glm") %>%
  
  set_mode("classification")

然后装上。

workflow() %>% 
  
  add_recipe(df_recipe) %>% 
  
  add_model(logistic_model) %>% 
  
  fit(data = df_train)

这是我收到错误的地方

Error:
! Can't use NA as column index with `[` at positions 3 and 4.

我已经检查并重新检查了。欢迎任何帮助。

Answer 1

我正在回答我自己的问题。

我意识到的一件事是问题出在 recipe 这一步。当我用 step_dummy 替换 step_str2factor 时，一切正常。

我仍然不知道为什么会这样。也许我需要更加敏锐地学习Tidymodels！！

Answer 2

在 df_recipe 中，删除 prep(training = df_train) 也就是这样定义它：

## Create a recipe
df_recipe <- recipes::recipe(left ~ .,
                             data = df_train) %>%
  ##We upsample the data to balance the outcome variable
  themis::step_upsample(left,
                        over_ratio = 1,
                        seed = 500) %>%
  ##We make all character variables factors
  step_string2factor(all_nominal_predictors()) %>%
  ##We remove one in a pair of highly correlated variables
  ## The threshold for removal is 0.85 (absolute)
  ## The choice of threshold is subjective.
  step_corr(all_numeric_predictors(),
            threshold = 0.85)

当我运行 fit() 时，删除 prep() 没有导致错误。我相信这里不需要 prep() 因为 workflow() 做同样的事情。

来自帮助消息?workflow：

When you specify and fit a model with a workflow(), parsnip and workflows match and reproduce the underlying behavior of the user-specified model’s computational engine.

来自 ?prep():

 If you are using a recipe as a preprocessor for modeling, we highly recommend that you use a workflow() instead of manually estimating a recipe (see the example in recipe()).

那你应该什么时候使用prep()呢？根据我的经验，当您想了解数据的外观时，它会很有帮助 post-processing:

#how training data will look like after your recipe steps are applied
df_recipe %>% 
  prep() %>% 
  bake(new_data = NULL)
# A tibble: 10,134 x 9
   department  promoted review projects salary tenure satisfaction bonus left 
   <fct>          <int>  <dbl>    <int> <fct>   <dbl>        <dbl> <int> <fct>
 1 operations         0  0.578        3 low         5        0.627     0 no   
 2 sales              0  0.676        3 high        5        0.578     1 no   
 3 admin              0  0.620        4 high        5        0.687     0 no   
 4 sales              0  0.653        4 low         6        0.679     0 no   
 5 sales              0  0.642        3 medium      6        0.623     0 no   
 6 support            0  0.563        4 medium      5        0.559     0 no   
 7 engineering        0  0.799        3 medium      5        0.433     1 no   
 8 marketing          0  0.611        3 low         6        0.502     0 no   
 9 sales              0  0.567        3 medium      6        0.845     0 no   
10 finance            0  0.583        3 medium      6        0.608     0 no   
# ... with 10,124 more rows

Problem Fitting Model in Tidymodels: Error: ! Can't use NA as column index with `[` at positions 3 and 4:\

Problem Fitting Model in Tidymodels: Error: ! Can't use NA as column index with `[` at positions 3 and 4:\

r

machine-learning

tidyverse

tidymodels