如何基于 R 中的两列创建虚拟对象
How to create dummies based on two columns in R
假设我有一个数据框:
性别可以取F为女或M为男
种族可以选择 A 为亚裔,W 为白人,B 为黑人,H 为西班牙裔
| id | Gender | Race |
| --- | ----- | ---- |
| 1 | F | W |
| 2 | F | B |
| 3 | M | A |
| 4 | F | B |
| 5 | M | W |
| 6 | M | B |
| 7 | F | H |
我想要一组列作为基于性别和种族的虚拟对象,数据框应该像
| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | F | W | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | F | B | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | M | A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | F | B | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | M | W | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 6 | M | B | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 7 | F | H | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
我的实际数据包含的类别比这个示例多得多,所以如果您能以更简洁的方式制作它,我将不胜感激。
语言是R。
感谢您的帮助。
除了列名之外,您还可以使用 model.matrix
函数和一个仅表示交互作用项并减去截距的公式:
> dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
> dm
id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
1 1 F H 0 0 0 0
2 2 M H 0 0 0 0
3 3 M W 0 0 0 0
4 4 F H 0 0 0 0
5 5 M H 0 0 0 0
[etc]
如果您关心确切的名称,通过一些字符串处理就可以很容易地将它们分类。
> names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
> dm
id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
1 1 F H 0 0 0 0 1 0 0 0
2 2 M H 0 0 0 0 0 1 0 0
3 3 M W 0 0 0 0 0 0 0 1
4 4 F H 0 0 0 0 1 0 0 0
5 5 M H 0 0 0 0 0 1 0 0
6 6 F H 0 0 0 0 1 0 0 0
7 7 F H 0 0 0 0 1 0 0 0
8 8 M A 0 1 0 0 0 0 0 0
9 9 M W 0 0 0 0 0 0 0 1
10 10 F B 0 0 1 0 0 0 0 0
如果您关心列顺序....
我认为您可以使用以下解决方案。它实际上比您想要的输出少 2 个变量,尽管如此,输出仍将为零。因为pivot_wider
会散布数据集中所有能找到的组合。
library(dplyr)
library(tidyr)
df %>%
mutate(grp = 1) %>%
pivot_wider(names_from = c(Gender, Race), values_from = grp,
values_fill = 0, names_glue = "{Gender}_{Race}") %>%
right_join(df, by = "id") %>%
relocate(id, Gender, Race)
# A tibble: 7 x 9
id Gender Race F_W F_B M_A M_W M_B F_H
<int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F W 1 0 0 0 0 0
2 2 F B 0 1 0 0 0 0
3 3 M A 0 0 1 0 0 0
4 4 F B 0 1 0 0 0 0
5 5 M W 0 0 0 1 0 0
6 6 M B 0 0 0 0 1 0
7 7 F H 0 0 0 0 0 1
除了来自 Anoushiravan R 的 tidyverse 解决方案。
这是 unite
、pivot_wider
、across
和 case_when
的另一个选项
library(tidyverse)
df %>%
unite(comb, Gender:Race, remove = FALSE) %>%
pivot_wider(
names_from = comb,
values_from = comb
) %>%
mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H),
~ case_when(is.na(.) ~ 0,
TRUE ~ 1)))
输出:
id Gender Race F_W F_B M_A M_W M_B F_H
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F W 1 0 0 0 0 0
2 2 F B 0 1 0 0 0 0
3 3 M A 0 0 1 0 0 0
4 4 F B 0 1 0 0 0 0
5 5 M W 0 0 0 1 0 0
6 6 M B 0 0 0 0 1 0
7 7 F H 0 0 0 0 0 1
base R
选项 table
cbind(df1, as.data.frame.matrix(table(transform(df1,
GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
id Gender Race F_B F_H F_W M_A M_B M_W
1 1 F W 0 0 1 0 0 0
2 2 F B 1 0 0 0 0 0
3 3 M A 0 0 0 1 0 0
4 4 F B 1 0 0 0 0 0
5 5 M W 0 0 0 0 0 1
6 6 M B 0 0 0 0 1 0
7 7 F H 0 1 0 0 0 0
数据
df1 <- structure(list(id = 1:7, Gender = c("F", "F", "M", "F", "M",
"M", "F"), Race = c("W", "B", "A", "B", "W", "B", "H")),
class = "data.frame", row.names = c(NA,
-7L))
另一个基础 R 选项 xtabs
cbind(
df,
as.data.frame.matrix(
xtabs(
~ id + q,
transform(
df,
q = paste0(Gender, "_", Race)
)
)
)
)
给予
id Gender Race F_B F_H F_W M_A M_B M_W
1 1 F W 0 0 1 0 0 0
2 2 F B 1 0 0 0 0 0
3 3 M A 0 0 0 1 0 0
4 4 F B 1 0 0 0 0 0
5 5 M W 0 0 0 0 0 1
6 6 M B 0 0 0 0 1 0
7 7 F H 0 1 0 0 0 0
假设我有一个数据框: 性别可以取F为女或M为男 种族可以选择 A 为亚裔,W 为白人,B 为黑人,H 为西班牙裔
| id | Gender | Race |
| --- | ----- | ---- |
| 1 | F | W |
| 2 | F | B |
| 3 | M | A |
| 4 | F | B |
| 5 | M | W |
| 6 | M | B |
| 7 | F | H |
我想要一组列作为基于性别和种族的虚拟对象,数据框应该像
| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | F | W | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | F | B | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | M | A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | F | B | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | M | W | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 6 | M | B | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 7 | F | H | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
我的实际数据包含的类别比这个示例多得多,所以如果您能以更简洁的方式制作它,我将不胜感激。 语言是R。 感谢您的帮助。
除了列名之外,您还可以使用 model.matrix
函数和一个仅表示交互作用项并减去截距的公式:
> dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
> dm
id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
1 1 F H 0 0 0 0
2 2 M H 0 0 0 0
3 3 M W 0 0 0 0
4 4 F H 0 0 0 0
5 5 M H 0 0 0 0
[etc]
如果您关心确切的名称,通过一些字符串处理就可以很容易地将它们分类。
> names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
> dm
id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
1 1 F H 0 0 0 0 1 0 0 0
2 2 M H 0 0 0 0 0 1 0 0
3 3 M W 0 0 0 0 0 0 0 1
4 4 F H 0 0 0 0 1 0 0 0
5 5 M H 0 0 0 0 0 1 0 0
6 6 F H 0 0 0 0 1 0 0 0
7 7 F H 0 0 0 0 1 0 0 0
8 8 M A 0 1 0 0 0 0 0 0
9 9 M W 0 0 0 0 0 0 0 1
10 10 F B 0 0 1 0 0 0 0 0
如果您关心列顺序....
我认为您可以使用以下解决方案。它实际上比您想要的输出少 2 个变量,尽管如此,输出仍将为零。因为pivot_wider
会散布数据集中所有能找到的组合。
library(dplyr)
library(tidyr)
df %>%
mutate(grp = 1) %>%
pivot_wider(names_from = c(Gender, Race), values_from = grp,
values_fill = 0, names_glue = "{Gender}_{Race}") %>%
right_join(df, by = "id") %>%
relocate(id, Gender, Race)
# A tibble: 7 x 9
id Gender Race F_W F_B M_A M_W M_B F_H
<int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F W 1 0 0 0 0 0
2 2 F B 0 1 0 0 0 0
3 3 M A 0 0 1 0 0 0
4 4 F B 0 1 0 0 0 0
5 5 M W 0 0 0 1 0 0
6 6 M B 0 0 0 0 1 0
7 7 F H 0 0 0 0 0 1
除了来自 Anoushiravan R 的 tidyverse 解决方案。
这是 unite
、pivot_wider
、across
和 case_when
library(tidyverse)
df %>%
unite(comb, Gender:Race, remove = FALSE) %>%
pivot_wider(
names_from = comb,
values_from = comb
) %>%
mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H),
~ case_when(is.na(.) ~ 0,
TRUE ~ 1)))
输出:
id Gender Race F_W F_B M_A M_W M_B F_H
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F W 1 0 0 0 0 0
2 2 F B 0 1 0 0 0 0
3 3 M A 0 0 1 0 0 0
4 4 F B 0 1 0 0 0 0
5 5 M W 0 0 0 1 0 0
6 6 M B 0 0 0 0 1 0
7 7 F H 0 0 0 0 0 1
base R
选项 table
cbind(df1, as.data.frame.matrix(table(transform(df1,
GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
id Gender Race F_B F_H F_W M_A M_B M_W
1 1 F W 0 0 1 0 0 0
2 2 F B 1 0 0 0 0 0
3 3 M A 0 0 0 1 0 0
4 4 F B 1 0 0 0 0 0
5 5 M W 0 0 0 0 0 1
6 6 M B 0 0 0 0 1 0
7 7 F H 0 1 0 0 0 0
数据
df1 <- structure(list(id = 1:7, Gender = c("F", "F", "M", "F", "M",
"M", "F"), Race = c("W", "B", "A", "B", "W", "B", "H")),
class = "data.frame", row.names = c(NA,
-7L))
另一个基础 R 选项 xtabs
cbind(
df,
as.data.frame.matrix(
xtabs(
~ id + q,
transform(
df,
q = paste0(Gender, "_", Race)
)
)
)
)
给予
id Gender Race F_B F_H F_W M_A M_B M_W
1 1 F W 0 0 1 0 0 0
2 2 F B 1 0 0 0 0 0
3 3 M A 0 0 0 1 0 0
4 4 F B 1 0 0 0 0 0
5 5 M W 0 0 0 0 0 1
6 6 M B 0 0 0 0 1 0
7 7 F H 0 1 0 0 0 0