Error: "Missing data in columns" when using tidymodels workflow to predict test set
Error: "Missing data in columns" when using tidymodels workflow to predict test set
最近学习使用tidymodels搭建机器学习工作流,但是当我使用该工作流对测试集进行预测时,出现了“Missing data in columns”的错误,但我确信无论是train 和测试集缺少数据。这是我的代码和示例:
# Imformation of the data:the Primary_type in test set has several novel levels
str(train_sample)
tibble [500,000 x 3] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:500000] 6590508 2902772 6162081 7777470 7134849 ...
$ Primary_type: Factor w/ 29 levels "ARSON","ASSAULT",..: 16 8 3 3 28 7 3 4 25 15 ...
$ Arrest : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 2 1 1 1 1 ...
str(test_sample)
tibble [300,000 x 3] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:300000] 8876633 9868538 9210518 9279377 8707153 ...
$ Primary_type: Factor w/ 32 levels "ARSON","ASSAULT",..: 3 7 31 7 2 8 7 2 31 18 ...
$ Arrest : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 2 1 1 1 2 2 ...
# set the recipe
rec <- recipe(Arrest ~ ., data = train_sample) %>%
update_role(ID, new_role = "ID") %>%
step_novel(Primary_type)
# set the model
rf_model <- rand_forest(trees = 10) %>%
set_engine("ranger", seed = 100, num.threads = 12, verbose = TRUE) %>%
set_mode("classification")
# set the workflow
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(rf_model)
# fit the train data
wf_fit <- wf %>% fit(train_sample)
# predict the test data
wf_pred <- wf_fit %>% predict(test_sample)
预测出现以下错误:
ERROR:Missing data in columns: Primary_type.
然而,当我使用 prep() 和 bake() 分别构建工作流程时,预测不会引发错误:
# set the workflow seperately
train_prep <- prep(rec, training = train_sample)
train_bake <- bake(train_prep, new_data = NULL)
test_bake <- bake(train_prep, new_data = test_sample)
# fit the baked train data
rf_model_fit <- rf_model %>% fit(Arrest ~ Primary_type, train_bake)
# predict the baked test data
rf_model_pred <- rf_model_fit %>% predict(test_bake) # No missing data error
我发现两个烘焙数据集中 Primary_type 的水平是相同的,这意味着 step_novel() 有效。
# compare the levels bewteen baked data sets
identical(levels(train_bake$Primary_type), levels(test_bake$Primary_type))
[1] TRUE
那么,为什么预测在工作流中失败,而在单独进行时成功呢?以及缺失数据是如何产生的?非常感谢。
我建议您在 "Ordering of Steps", especially the section on handling levels in categorical data 上查看此建议。您应该在其他因素处理操作之前使用 step_novel()
。
最近学习使用tidymodels搭建机器学习工作流,但是当我使用该工作流对测试集进行预测时,出现了“Missing data in columns”的错误,但我确信无论是train 和测试集缺少数据。这是我的代码和示例:
# Imformation of the data:the Primary_type in test set has several novel levels
str(train_sample)
tibble [500,000 x 3] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:500000] 6590508 2902772 6162081 7777470 7134849 ...
$ Primary_type: Factor w/ 29 levels "ARSON","ASSAULT",..: 16 8 3 3 28 7 3 4 25 15 ...
$ Arrest : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 2 1 1 1 1 ...
str(test_sample)
tibble [300,000 x 3] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:300000] 8876633 9868538 9210518 9279377 8707153 ...
$ Primary_type: Factor w/ 32 levels "ARSON","ASSAULT",..: 3 7 31 7 2 8 7 2 31 18 ...
$ Arrest : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 2 1 1 1 2 2 ...
# set the recipe
rec <- recipe(Arrest ~ ., data = train_sample) %>%
update_role(ID, new_role = "ID") %>%
step_novel(Primary_type)
# set the model
rf_model <- rand_forest(trees = 10) %>%
set_engine("ranger", seed = 100, num.threads = 12, verbose = TRUE) %>%
set_mode("classification")
# set the workflow
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(rf_model)
# fit the train data
wf_fit <- wf %>% fit(train_sample)
# predict the test data
wf_pred <- wf_fit %>% predict(test_sample)
预测出现以下错误:
ERROR:Missing data in columns: Primary_type.
然而,当我使用 prep() 和 bake() 分别构建工作流程时,预测不会引发错误:
# set the workflow seperately
train_prep <- prep(rec, training = train_sample)
train_bake <- bake(train_prep, new_data = NULL)
test_bake <- bake(train_prep, new_data = test_sample)
# fit the baked train data
rf_model_fit <- rf_model %>% fit(Arrest ~ Primary_type, train_bake)
# predict the baked test data
rf_model_pred <- rf_model_fit %>% predict(test_bake) # No missing data error
我发现两个烘焙数据集中 Primary_type 的水平是相同的,这意味着 step_novel() 有效。
# compare the levels bewteen baked data sets
identical(levels(train_bake$Primary_type), levels(test_bake$Primary_type))
[1] TRUE
那么,为什么预测在工作流中失败,而在单独进行时成功呢?以及缺失数据是如何产生的?非常感谢。
我建议您在 "Ordering of Steps", especially the section on handling levels in categorical data 上查看此建议。您应该在其他因素处理操作之前使用 step_novel()
。