Problem Fitting Model in Tidymodels: Error: ! Can't use NA as column index with `[` at positions 3 and 4:\
Problem Fitting Model in Tidymodels: Error: ! Can't use NA as column index with `[` at positions 3 and 4:\
我正在 Tidymodels
写一个项目。我创建了一个train
和test
集,列了一个recipe
和一个model
。当我调用 workflow()
、添加 recipe
和 model
,然后调用 fit(data = df_train
时,出现以下错误。
Error:
! Can't use NA as column index with `[` at positions 3 and 4.
我使用的是 R 版本 4.1.3 和 R Studio 2022.02.0 Build 443。
为了可重复性,这里是工作流程。请注意,数据在 GitHub 上,因此您需要互联网连接才能加载数据。
## Load package manager
if(!require(pacman)){
install.packages("pacman")
}
## Load required packages. Download them if they do not exist in my system.
pacman::p_load(tidyverse, kableExtra, skimr, knitr, glue, GGally,
corrplot, tidymodels, themis, stargazer, rpart, rpart.plot,
vip, patchwork, data.table)
下一步将加载数据。
df <- fread('https://raw.githubusercontent.com/Karuitha/data_projects/master/employee_turnover/data/employee_churn_data.csv') %>%
mutate(left = factor(left, levels = c("yes", "no")))
接下来,我将数据分成训练集和测试集并创建食谱。
## Create a split object consisting 75% of data
split_object <- initial_split(df, prop = 0.75,
strata = left)
## Generate the training set
df_train <- split_object %>%
training()
## Generate the testing set
df_test <- split_object %>%
testing()
###############################################
## Create a recipe
df_recipe <- recipes::recipe(left ~ .,
data = df_train) %>%
##We upsample the data to balance the outcome variable
themis::step_upsample(left,
over_ratio = 1,
seed = 500) %>%
##We make all character variables factors
step_string2factor(all_nominal_predictors()) %>%
##We remove one in a pair of highly correlated variables
## The threshold for removal is 0.85 (absolute)
## The choice of threshold is subjective.
step_corr(all_numeric_predictors(),
threshold = 0.85) %>%
## Train these steps on the training data
prep(training = df_train)
接下来,我定义一个模型并尝试拟合。
## Define a logistic model
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
然后装上。
workflow() %>%
add_recipe(df_recipe) %>%
add_model(logistic_model) %>%
fit(data = df_train)
这是我收到错误的地方
Error:
! Can't use NA as column index with `[` at positions 3 and 4.
我已经检查并重新检查了。欢迎任何帮助。
我正在回答我自己的问题。
我意识到的一件事是问题出在 recipe
这一步。当我用 step_dummy
替换 step_str2factor
时,一切正常。
我仍然不知道为什么会这样。也许我需要更加敏锐地学习Tidymodels!!
在 df_recipe
中,删除 prep(training = df_train)
也就是这样定义它:
## Create a recipe
df_recipe <- recipes::recipe(left ~ .,
data = df_train) %>%
##We upsample the data to balance the outcome variable
themis::step_upsample(left,
over_ratio = 1,
seed = 500) %>%
##We make all character variables factors
step_string2factor(all_nominal_predictors()) %>%
##We remove one in a pair of highly correlated variables
## The threshold for removal is 0.85 (absolute)
## The choice of threshold is subjective.
step_corr(all_numeric_predictors(),
threshold = 0.85)
当我 运行 fit() 时,删除 prep()
没有导致错误。我相信这里不需要 prep()
因为 workflow()
做同样的事情。
来自帮助消息?workflow
:
When you specify and fit a model with a workflow(), parsnip and workflows match and reproduce the underlying behavior of the user-specified model’s computational engine.
来自 ?prep()
:
If you are using a recipe as a preprocessor for modeling, we highly recommend that you use a workflow() instead of manually estimating a recipe (see the example in recipe()).
那你应该什么时候使用prep()
呢?根据我的经验,当您想了解数据的外观时,它会很有帮助 post-processing:
#how training data will look like after your recipe steps are applied
df_recipe %>%
prep() %>%
bake(new_data = NULL)
# A tibble: 10,134 x 9
department promoted review projects salary tenure satisfaction bonus left
<fct> <int> <dbl> <int> <fct> <dbl> <dbl> <int> <fct>
1 operations 0 0.578 3 low 5 0.627 0 no
2 sales 0 0.676 3 high 5 0.578 1 no
3 admin 0 0.620 4 high 5 0.687 0 no
4 sales 0 0.653 4 low 6 0.679 0 no
5 sales 0 0.642 3 medium 6 0.623 0 no
6 support 0 0.563 4 medium 5 0.559 0 no
7 engineering 0 0.799 3 medium 5 0.433 1 no
8 marketing 0 0.611 3 low 6 0.502 0 no
9 sales 0 0.567 3 medium 6 0.845 0 no
10 finance 0 0.583 3 medium 6 0.608 0 no
# ... with 10,124 more rows
我正在 Tidymodels
写一个项目。我创建了一个train
和test
集,列了一个recipe
和一个model
。当我调用 workflow()
、添加 recipe
和 model
,然后调用 fit(data = df_train
时,出现以下错误。
Error:
! Can't use NA as column index with `[` at positions 3 and 4.
我使用的是 R 版本 4.1.3 和 R Studio 2022.02.0 Build 443。
为了可重复性,这里是工作流程。请注意,数据在 GitHub 上,因此您需要互联网连接才能加载数据。
## Load package manager
if(!require(pacman)){
install.packages("pacman")
}
## Load required packages. Download them if they do not exist in my system.
pacman::p_load(tidyverse, kableExtra, skimr, knitr, glue, GGally,
corrplot, tidymodels, themis, stargazer, rpart, rpart.plot,
vip, patchwork, data.table)
下一步将加载数据。
df <- fread('https://raw.githubusercontent.com/Karuitha/data_projects/master/employee_turnover/data/employee_churn_data.csv') %>%
mutate(left = factor(left, levels = c("yes", "no")))
接下来,我将数据分成训练集和测试集并创建食谱。
## Create a split object consisting 75% of data
split_object <- initial_split(df, prop = 0.75,
strata = left)
## Generate the training set
df_train <- split_object %>%
training()
## Generate the testing set
df_test <- split_object %>%
testing()
###############################################
## Create a recipe
df_recipe <- recipes::recipe(left ~ .,
data = df_train) %>%
##We upsample the data to balance the outcome variable
themis::step_upsample(left,
over_ratio = 1,
seed = 500) %>%
##We make all character variables factors
step_string2factor(all_nominal_predictors()) %>%
##We remove one in a pair of highly correlated variables
## The threshold for removal is 0.85 (absolute)
## The choice of threshold is subjective.
step_corr(all_numeric_predictors(),
threshold = 0.85) %>%
## Train these steps on the training data
prep(training = df_train)
接下来,我定义一个模型并尝试拟合。
## Define a logistic model
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
然后装上。
workflow() %>%
add_recipe(df_recipe) %>%
add_model(logistic_model) %>%
fit(data = df_train)
这是我收到错误的地方
Error:
! Can't use NA as column index with `[` at positions 3 and 4.
我已经检查并重新检查了。欢迎任何帮助。
我正在回答我自己的问题。
我意识到的一件事是问题出在 recipe
这一步。当我用 step_dummy
替换 step_str2factor
时,一切正常。
我仍然不知道为什么会这样。也许我需要更加敏锐地学习Tidymodels!!
在 df_recipe
中,删除 prep(training = df_train)
也就是这样定义它:
## Create a recipe
df_recipe <- recipes::recipe(left ~ .,
data = df_train) %>%
##We upsample the data to balance the outcome variable
themis::step_upsample(left,
over_ratio = 1,
seed = 500) %>%
##We make all character variables factors
step_string2factor(all_nominal_predictors()) %>%
##We remove one in a pair of highly correlated variables
## The threshold for removal is 0.85 (absolute)
## The choice of threshold is subjective.
step_corr(all_numeric_predictors(),
threshold = 0.85)
当我 运行 fit() 时,删除 prep()
没有导致错误。我相信这里不需要 prep()
因为 workflow()
做同样的事情。
来自帮助消息?workflow
:
When you specify and fit a model with a workflow(), parsnip and workflows match and reproduce the underlying behavior of the user-specified model’s computational engine.
来自 ?prep()
:
If you are using a recipe as a preprocessor for modeling, we highly recommend that you use a workflow() instead of manually estimating a recipe (see the example in recipe()).
那你应该什么时候使用prep()
呢?根据我的经验,当您想了解数据的外观时,它会很有帮助 post-processing:
#how training data will look like after your recipe steps are applied
df_recipe %>%
prep() %>%
bake(new_data = NULL)
# A tibble: 10,134 x 9
department promoted review projects salary tenure satisfaction bonus left
<fct> <int> <dbl> <int> <fct> <dbl> <dbl> <int> <fct>
1 operations 0 0.578 3 low 5 0.627 0 no
2 sales 0 0.676 3 high 5 0.578 1 no
3 admin 0 0.620 4 high 5 0.687 0 no
4 sales 0 0.653 4 low 6 0.679 0 no
5 sales 0 0.642 3 medium 6 0.623 0 no
6 support 0 0.563 4 medium 5 0.559 0 no
7 engineering 0 0.799 3 medium 5 0.433 1 no
8 marketing 0 0.611 3 low 6 0.502 0 no
9 sales 0 0.567 3 medium 6 0.845 0 no
10 finance 0 0.583 3 medium 6 0.608 0 no
# ... with 10,124 more rows