如何在 tidyverse 中为每组另一个变量创建虚拟变量
How to create dummy variables per group of another variable in tidyverse
我想创建(虚拟)变量来显示一个观察值是否在一组观察值中(可由一个共同的 Group_ID 识别),该组具有特定的特征组合。代码示例使我的意思更清楚。
我尝试了 group_by 和 caret::dummyVars 的组合,但没有成功。我 运行 没有想法 - 任何帮助将不胜感激。
library(tidyverse)
# Input data
# please note: in my case each value of the column Role will appear only once per Group_ID.
input_data <- tribble( ~Group_ID, ~Role, ~Income,
#--|--|----
1, "a", 3.6,
1, "b", 8.5,
2, "a", 7.6,
2, "c", 9.5,
2, "d", 9.7,
3, "a", 1.6,
3, "b", 4.5,
3, "c", 2.7,
3, "e", 7.7,
4, "b", 3.3,
4, "c", 6.2,
)
# desired output
output_data <- tribble( ~Group_ID, ~Role, ~Income, ~Role_A, ~Role_B, ~Role_C, ~Role_D, ~Role_E, ~All_roles,
#--|--|----
1, "a", 3.6, 1, 1, 0, 0, 0, "ab",
1, "b", 8.5, 1, 1, 0, 0, 0, "ab",
2, "a", 7.6, 1, 0, 1, 1, 0, "acd",
2, "c", 9.5, 1, 0, 1, 1, 0, "acd",
2, "d", 9.7, 1, 0, 1, 1, 0, "acd",
3, "a", 1.6, 1, 1, 1, 0, 1, "abce",
3, "b", 4.5, 1, 1, 1, 0, 1, "abce",
3, "c", 2.7, 1, 1, 1, 0, 1, "abce",
3, "e", 7.7, 1, 1, 1, 0, 1, "abce",
4, "b", 3.3, 0, 1, 1, 0, 0, "bc",
4, "c", 6.2, 0, 1, 1, 0, 0, "bc"
)
使用 dplyr
和 splitstackshape
中的 cSplit_e
。对于每个 Group_ID
,我们将 Role
粘贴在一起,然后使用 cSplit_e
.
根据它们的存在与否将它们分成新的二进制值列
library(splitstackshape)
library(dplyr)
input_data %>%
group_by(Group_ID) %>%
mutate(new_role = paste(Role, collapse = "")) %>%
ungroup() %>%
cSplit_e("new_role", sep = "", type = "character", fill = 0)
# Group_ID Role Income new_role new_role_a new_role_b new_role_c new_role_d new_role_e
#1 1 a 3.6 ab 1 1 0 0 0
#2 1 b 8.5 ab 1 1 0 0 0
#3 2 a 7.6 acd 1 0 1 1 0
#4 2 c 9.5 acd 1 0 1 1 0
#5 2 d 9.7 acd 1 0 1 1 0
#6 3 a 1.6 abce 1 1 1 0 1
#7 3 b 4.5 abce 1 1 1 0 1
#8 3 c 2.7 abce 1 1 1 0 1
#9 3 e 7.7 abce 1 1 1 0 1
#10 4 b 3.3 bc 0 1 1 0 0
#11 4 c 6.2 bc 0 1 1 0 0
下面利用基本的 R 建模函数来创建假人。
首先,创建一个没有截距的模型矩阵。
fit <- lm(Group_ID ~ 0 + Role, input_data)
m <- model.matrix(fit)
现在,通过注意问题要求的虚拟变量是 Group_ID
组的总和来处理该矩阵。
input_data %>%
bind_cols(m %>% as.data.frame()) %>%
group_by(Group_ID) %>%
mutate_at(vars(matches("Role[[:alpha:]]")), sum) %>%
mutate(all_roles = paste(Role, collapse = ""))
## A tibble: 11 x 9
## Groups: Group_ID [4]
# Group_ID Role Income Rolea Roleb Rolec Roled Rolee all_roles
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 1 a 3.6 1 1 0 0 0 ab
# 2 1 b 8.5 1 1 0 0 0 ab
# 3 2 a 7.6 1 0 1 1 0 acd
# 4 2 c 9.5 1 0 1 1 0 acd
# 5 2 d 9.7 1 0 1 1 0 acd
# 6 3 a 1.6 1 1 1 0 1 abce
# 7 3 b 4.5 1 1 1 0 1 abce
# 8 3 c 2.7 1 1 1 0 1 abce
# 9 3 e 7.7 1 1 1 0 1 abce
#10 4 b 3.3 0 1 1 0 0 bc
#11 4 c 6.2 0 1 1 0 0 bc
我想创建(虚拟)变量来显示一个观察值是否在一组观察值中(可由一个共同的 Group_ID 识别),该组具有特定的特征组合。代码示例使我的意思更清楚。
我尝试了 group_by 和 caret::dummyVars 的组合,但没有成功。我 运行 没有想法 - 任何帮助将不胜感激。
library(tidyverse)
# Input data
# please note: in my case each value of the column Role will appear only once per Group_ID.
input_data <- tribble( ~Group_ID, ~Role, ~Income,
#--|--|----
1, "a", 3.6,
1, "b", 8.5,
2, "a", 7.6,
2, "c", 9.5,
2, "d", 9.7,
3, "a", 1.6,
3, "b", 4.5,
3, "c", 2.7,
3, "e", 7.7,
4, "b", 3.3,
4, "c", 6.2,
)
# desired output
output_data <- tribble( ~Group_ID, ~Role, ~Income, ~Role_A, ~Role_B, ~Role_C, ~Role_D, ~Role_E, ~All_roles,
#--|--|----
1, "a", 3.6, 1, 1, 0, 0, 0, "ab",
1, "b", 8.5, 1, 1, 0, 0, 0, "ab",
2, "a", 7.6, 1, 0, 1, 1, 0, "acd",
2, "c", 9.5, 1, 0, 1, 1, 0, "acd",
2, "d", 9.7, 1, 0, 1, 1, 0, "acd",
3, "a", 1.6, 1, 1, 1, 0, 1, "abce",
3, "b", 4.5, 1, 1, 1, 0, 1, "abce",
3, "c", 2.7, 1, 1, 1, 0, 1, "abce",
3, "e", 7.7, 1, 1, 1, 0, 1, "abce",
4, "b", 3.3, 0, 1, 1, 0, 0, "bc",
4, "c", 6.2, 0, 1, 1, 0, 0, "bc"
)
使用 dplyr
和 splitstackshape
中的 cSplit_e
。对于每个 Group_ID
,我们将 Role
粘贴在一起,然后使用 cSplit_e
.
library(splitstackshape)
library(dplyr)
input_data %>%
group_by(Group_ID) %>%
mutate(new_role = paste(Role, collapse = "")) %>%
ungroup() %>%
cSplit_e("new_role", sep = "", type = "character", fill = 0)
# Group_ID Role Income new_role new_role_a new_role_b new_role_c new_role_d new_role_e
#1 1 a 3.6 ab 1 1 0 0 0
#2 1 b 8.5 ab 1 1 0 0 0
#3 2 a 7.6 acd 1 0 1 1 0
#4 2 c 9.5 acd 1 0 1 1 0
#5 2 d 9.7 acd 1 0 1 1 0
#6 3 a 1.6 abce 1 1 1 0 1
#7 3 b 4.5 abce 1 1 1 0 1
#8 3 c 2.7 abce 1 1 1 0 1
#9 3 e 7.7 abce 1 1 1 0 1
#10 4 b 3.3 bc 0 1 1 0 0
#11 4 c 6.2 bc 0 1 1 0 0
下面利用基本的 R 建模函数来创建假人。
首先,创建一个没有截距的模型矩阵。
fit <- lm(Group_ID ~ 0 + Role, input_data)
m <- model.matrix(fit)
现在,通过注意问题要求的虚拟变量是 Group_ID
组的总和来处理该矩阵。
input_data %>%
bind_cols(m %>% as.data.frame()) %>%
group_by(Group_ID) %>%
mutate_at(vars(matches("Role[[:alpha:]]")), sum) %>%
mutate(all_roles = paste(Role, collapse = ""))
## A tibble: 11 x 9
## Groups: Group_ID [4]
# Group_ID Role Income Rolea Roleb Rolec Roled Rolee all_roles
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 1 a 3.6 1 1 0 0 0 ab
# 2 1 b 8.5 1 1 0 0 0 ab
# 3 2 a 7.6 1 0 1 1 0 acd
# 4 2 c 9.5 1 0 1 1 0 acd
# 5 2 d 9.7 1 0 1 1 0 acd
# 6 3 a 1.6 1 1 1 0 1 abce
# 7 3 b 4.5 1 1 1 0 1 abce
# 8 3 c 2.7 1 1 1 0 1 abce
# 9 3 e 7.7 1 1 1 0 1 abce
#10 4 b 3.3 0 1 1 0 0 bc
#11 4 c 6.2 0 1 1 0 0 bc