列名在配方中重复
Column name being duplicated in recipe
这是我遇到问题的一段代码:
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
错误:
Error:
! Column name `funder_W.D...I.` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
Backtrace:
1. recipes::prep(pump_recipe)
2. recipes:::prep.recipe(pump_recipe)
4. recipes:::bake.step_dummy(x$steps[[i]], new_data = training)
8. tibble:::as_tibble.data.frame(indicators)
9. tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
...
16. vctrs `<fn>`()
17. vctrs:::validate_unique(names = names, arg = arg)
18. vctrs:::stop_names_must_be_unique(names, arg)
19. vctrs:::stop_names(...)
20. vctrs:::stop_vctrs(class = c(class, "vctrs_error_names"), ...)
Error:
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
所以基本上 step_dummy
这一步似乎做了一些奇怪的事情,并在此处创建了一个重复的列。我不知道为什么会这样。这是我正在处理的数据:
您在 funder
和 installer
中的关卡非常相似,以至于 step_dummy()
创建了同名标签。错误说 funder_W.D...I.
出现了两次。
如果我们对 funder
列进行一些过滤,我们会看到有 3 个不同的名称匹配。
str_subset(data$funder, "W.D") |> unique()
[1] "W.D.&.I." "W.D & I." "W.D &"
"W.D.&.I."
或 "W.D & I."
都不是有效名称,因此 step_dummy()
试图修复它们。这会为两者产生 "funder_W.D...I."
。
您可以使用 textrecipes::step_clean_levels()
解决此问题,这可确保这些变量的级别保持有效并且 non-overlapping。
library(recipes)
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
textrecipes::step_clean_levels(funder, installer) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
注意:如您所说,我认为 "W.D.&.I."
、"W.D & I."
和 "W.D &"
都指代同一个实体。您应该看看是否可以手动折叠这些级别。
这是我遇到问题的一段代码:
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
错误:
Error:
! Column name `funder_W.D...I.` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
Backtrace:
1. recipes::prep(pump_recipe)
2. recipes:::prep.recipe(pump_recipe)
4. recipes:::bake.step_dummy(x$steps[[i]], new_data = training)
8. tibble:::as_tibble.data.frame(indicators)
9. tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
...
16. vctrs `<fn>`()
17. vctrs:::validate_unique(names = names, arg = arg)
18. vctrs:::stop_names_must_be_unique(names, arg)
19. vctrs:::stop_names(...)
20. vctrs:::stop_vctrs(class = c(class, "vctrs_error_names"), ...)
Error:
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
所以基本上 step_dummy
这一步似乎做了一些奇怪的事情,并在此处创建了一个重复的列。我不知道为什么会这样。这是我正在处理的数据:
您在 funder
和 installer
中的关卡非常相似,以至于 step_dummy()
创建了同名标签。错误说 funder_W.D...I.
出现了两次。
如果我们对 funder
列进行一些过滤,我们会看到有 3 个不同的名称匹配。
str_subset(data$funder, "W.D") |> unique()
[1] "W.D.&.I." "W.D & I." "W.D &"
"W.D.&.I."
或 "W.D & I."
都不是有效名称,因此 step_dummy()
试图修复它们。这会为两者产生 "funder_W.D...I."
。
您可以使用 textrecipes::step_clean_levels()
解决此问题,这可确保这些变量的级别保持有效并且 non-overlapping。
library(recipes)
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
textrecipes::step_clean_levels(funder, installer) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
注意:如您所说,我认为 "W.D.&.I."
、"W.D & I."
和 "W.D &"
都指代同一个实体。您应该看看是否可以手动折叠这些级别。