如何有效地重新编码以一个假人为条件的假人组?
How do I efficiently recode groups of dummies conditional on one dummy?
我正在尝试一次重新编码多个虚拟变量,但很难想出一个有效的矢量化解决方案(或者一个 for 循环)。
reprex:
library(tidyverse)
library(magrittr)
library(dummies)
library(janitor)
df_raw <- data.frame(
species = as.factor(c("cat", "dog", NA, "dog", "dog")),
weight = rnorm(5, mean = 5, sd = 1),
sex = as.factor(c("m", NA, "f", "f", "m"))
)
df_raw
species weight sex
1 cat 3.025896 m
2 dog 3.223064 <NA>
3 <NA> 5.230367 f
4 dog 4.231511 f
5 dog 5.819032 m
我将因子变量(species
和 sex
)拆分为虚拟变量,但 NA
得到了它们自己的指标(species_na
和 sex_na
)
df_dummy <- dummies::dummy.data.frame(df_raw,
dummy.classes = "factor",
sep = "_",
omit.constants = TRUE,
all = TRUE) %>%
janitor::clean_names()
species_cat species_dog species_na weight sex_f sex_m sex_na
1 1 0 0 3.025896 0 1 0
2 0 1 0 3.223064 0 0 1
3 0 0 1 5.230367 1 0 0
4 0 1 0 4.231511 1 0 0
5 0 1 0 5.819032 0 1 0
我的问题:我如何根据_na
相应假人组中的假人?换句话说,每当 species_na == 1
等
时,我需要将所有前缀为 species_
的假人突变为 NA
我想出了下面的解决方案,但我无法将最后一步推广到整个数据集
factor_vars <- dplyr::select_if(df_raw, is.factor) %>% colnames()
na_labs <- paste(factor_vars,
"na",
sep = "_")
df_dummy <- df_dummy %>%
dplyr::mutate(across(all_of(na_labs),
.fns = list(var = ~ . == 1),
.names = "{fn}_{col}" ))
# --- trial run for one variable only
test <- df_dummy %>%
mutate(species_cat = ifelse(var_species_na == TRUE,
NA,
species_cat))
感谢任何帮助!
你可以试试-
library(dplyr)
library(purrr)
df_dummy <- dummies::dummy.data.frame(df_raw,
dummy.classes = "factor",
sep = "_",
omit.constants = TRUE,
all = TRUE) %>%
janitor::clean_names()
factor_vars <- dplyr::select_if(df_raw, is.factor) %>% colnames()
na_labs <- paste(factor_vars,
"na",
sep = "_")
map_dfc(factor_vars, ~df_dummy %>%
select(contains(.x)) %>%
mutate(across(.fns = ~ifelse(.data[[paste0(.x, '_na')]] == 1, NA, .))))
# species_cat species_dog species_na sex_f sex_m sex_na
#1 1 0 0 0 1 0
#2 0 1 0 NA NA NA
#3 NA NA NA 1 0 0
#4 0 1 0 1 0 0
#5 0 1 0 0 1 0
我在 github {dplyover} 上有一个包,它可以以类似 across
的方式创建虚拟变量。下面我们 select 具有 where(is.factor)
的所有因子变量并应用于每一列 dist_value
这是 unique
的包装器 returns 所有非 NA
值。 .fns
中的函数将每个 selected 列作为 .x
并将 dist_values
中的每个唯一值应用于它作为 .y
.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df_raw %>%
mutate(crossover(where(is.factor),
dist_values,
.fns = ~ if_else(.y == .x, 1, 0)))
#> species weight sex species_cat species_dog sex_f sex_m
#> 1 cat 5.281178 m 1 0 0 1
#> 2 dog 4.343656 <NA> 0 1 NA NA
#> 3 <NA> 4.555380 f NA NA 1 0
#> 4 dog 4.990039 f 0 1 1 0
#> 5 dog 4.988497 m 0 1 0 1
由 reprex package (v2.0.1)
于 2021-09-13 创建
这个怎么样?
df_dummy <- df_dummy %>%
mutate(across(c(starts_with("species")), ~ factor(ifelse(species_na == 1, NA, .)))) %>%
mutate(across(c(starts_with("sex")), ~ factor(ifelse(sex_na == 1, NA, .))))
df_dummy
species_cat species_dog species_na weight sex_f sex_m sex_na
1 1 0 0 4.879161 0 1 0
2 0 1 0 5.960176 <NA> <NA> <NA>
3 <NA> <NA> <NA> 5.189566 1 0 0
4 0 1 0 5.165760 1 0 0
5 0 1 0 5.952365 0 1 0
我正在尝试一次重新编码多个虚拟变量,但很难想出一个有效的矢量化解决方案(或者一个 for 循环)。
reprex:
library(tidyverse)
library(magrittr)
library(dummies)
library(janitor)
df_raw <- data.frame(
species = as.factor(c("cat", "dog", NA, "dog", "dog")),
weight = rnorm(5, mean = 5, sd = 1),
sex = as.factor(c("m", NA, "f", "f", "m"))
)
df_raw
species weight sex
1 cat 3.025896 m
2 dog 3.223064 <NA>
3 <NA> 5.230367 f
4 dog 4.231511 f
5 dog 5.819032 m
我将因子变量(species
和 sex
)拆分为虚拟变量,但 NA
得到了它们自己的指标(species_na
和 sex_na
)
df_dummy <- dummies::dummy.data.frame(df_raw,
dummy.classes = "factor",
sep = "_",
omit.constants = TRUE,
all = TRUE) %>%
janitor::clean_names()
species_cat species_dog species_na weight sex_f sex_m sex_na
1 1 0 0 3.025896 0 1 0
2 0 1 0 3.223064 0 0 1
3 0 0 1 5.230367 1 0 0
4 0 1 0 4.231511 1 0 0
5 0 1 0 5.819032 0 1 0
我的问题:我如何根据_na
相应假人组中的假人?换句话说,每当 species_na == 1
等
species_
的假人突变为 NA
我想出了下面的解决方案,但我无法将最后一步推广到整个数据集
factor_vars <- dplyr::select_if(df_raw, is.factor) %>% colnames()
na_labs <- paste(factor_vars,
"na",
sep = "_")
df_dummy <- df_dummy %>%
dplyr::mutate(across(all_of(na_labs),
.fns = list(var = ~ . == 1),
.names = "{fn}_{col}" ))
# --- trial run for one variable only
test <- df_dummy %>%
mutate(species_cat = ifelse(var_species_na == TRUE,
NA,
species_cat))
感谢任何帮助!
你可以试试-
library(dplyr)
library(purrr)
df_dummy <- dummies::dummy.data.frame(df_raw,
dummy.classes = "factor",
sep = "_",
omit.constants = TRUE,
all = TRUE) %>%
janitor::clean_names()
factor_vars <- dplyr::select_if(df_raw, is.factor) %>% colnames()
na_labs <- paste(factor_vars,
"na",
sep = "_")
map_dfc(factor_vars, ~df_dummy %>%
select(contains(.x)) %>%
mutate(across(.fns = ~ifelse(.data[[paste0(.x, '_na')]] == 1, NA, .))))
# species_cat species_dog species_na sex_f sex_m sex_na
#1 1 0 0 0 1 0
#2 0 1 0 NA NA NA
#3 NA NA NA 1 0 0
#4 0 1 0 1 0 0
#5 0 1 0 0 1 0
我在 github {dplyover} 上有一个包,它可以以类似 across
的方式创建虚拟变量。下面我们 select 具有 where(is.factor)
的所有因子变量并应用于每一列 dist_value
这是 unique
的包装器 returns 所有非 NA
值。 .fns
中的函数将每个 selected 列作为 .x
并将 dist_values
中的每个唯一值应用于它作为 .y
.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df_raw %>%
mutate(crossover(where(is.factor),
dist_values,
.fns = ~ if_else(.y == .x, 1, 0)))
#> species weight sex species_cat species_dog sex_f sex_m
#> 1 cat 5.281178 m 1 0 0 1
#> 2 dog 4.343656 <NA> 0 1 NA NA
#> 3 <NA> 4.555380 f NA NA 1 0
#> 4 dog 4.990039 f 0 1 1 0
#> 5 dog 4.988497 m 0 1 0 1
由 reprex package (v2.0.1)
于 2021-09-13 创建这个怎么样?
df_dummy <- df_dummy %>%
mutate(across(c(starts_with("species")), ~ factor(ifelse(species_na == 1, NA, .)))) %>%
mutate(across(c(starts_with("sex")), ~ factor(ifelse(sex_na == 1, NA, .))))
df_dummy
species_cat species_dog species_na weight sex_f sex_m sex_na
1 1 0 0 4.879161 0 1 0
2 0 1 0 5.960176 <NA> <NA> <NA>
3 <NA> <NA> <NA> 5.189566 1 0 0
4 0 1 0 5.165760 1 0 0
5 0 1 0 5.952365 0 1 0