如何根据 R 中具有相同前缀的其他几个列中的值生成一组虚拟变量?
How to generate a set of dummy variables dependent on values in several other columns with same prefix in R?
我有一个数据框,我正在尝试将其放入一种可用于我的分析的格式,如下所示:
ID
Name
Year
K1
K2
...
K50
1
Contract XYZ
2000
transport
elephants
2
Agreement ABC
2003
pens
music
3
Document 123
2003
elephants
4
Empty Space
2004
music
transport
基本上,我的文档具有唯一 ID 和签名年份以及一组包含无序关键字的变量名称 K_1 到 K_50。对于每个关键字,我想生成一个虚拟变量(即命名为 transport、pens...),如果 K_1 到 K_50 条目中的任何一个包含该特定字符串。
我已经尝试了下面的代码,如果 K1 或 K2 包含代码,它会创建名为 transport 的虚拟对象,但是对于 50 列中的 50 个关键字,这是大量的手动工作。
理想情况下,我将能够使用字符串作为变量名称,并在所有 50 列中使用 运行 来识别所有关键字并创建假人,如果关键字出现在特定 ID 的任一列中,则假人将是一个假人。但是,我也很乐意手动创建虚拟对象并能够查看所有 50 个关键字而无需全部输入。
document_dummies <- mutate(document,
transport = case_when(
K_1 == "transport" | K_2 == "transport" ~ 1,
TRUE ~ NA_real_
))
这对你有帮助吗?
library(tidyverse)
document <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L),
Name = c("Contract XYZ","Agreement ABC",
"Document 123","Empty Space"),
Year = c(2000L, 2003L, 2003L, 2004L),
K1 = c("transport", "pens", "elephants", "music"),
K2 = c("elephants", "music", NA, NA),
K50 = c(NA, NA, NA, "transport")
)
document %>%
pivot_longer(starts_with("K")) %>%
select(-name) %>%
filter(! is.na(value)) %>%
mutate(has_property = 1) %>%
pivot_wider(names_from = value, values_from = has_property)
#> # A tibble: 4 x 7
#> ID Name Year transport elephants pens music
#> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Contract XYZ 2000 1 1 NA NA
#> 2 2 Agreement ABC 2003 NA NA 1 1
#> 3 3 Document 123 2003 NA 1 NA NA
#> 4 4 Empty Space 2004 1 NA NA 1
由 reprex package (v2.0.1)
创建于 2021-09-21
我们可以使用 fastDummies
包,具有 dummy_columns
功能。
使用来自@danlooo
的示例数据
document %>% pivot_longer(matches('K\d+'), names_to = NULL) %>%
filter(!is.na(value)) %>%
fastDummies::dummy_columns('value') %>%
rename_with(~str_remove(.x, '^value_'), starts_with('value_'))
# A tibble: 7 x 8
ID Name Year value elephants music pens transport
<int> <chr> <int> <chr> <int> <int> <int> <int>
1 1 Contract XYZ 2000 transport 0 0 0 1
2 1 Contract XYZ 2000 elephants 1 0 0 0
3 2 Agreement ABC 2003 pens 0 0 1 0
4 2 Agreement ABC 2003 music 0 1 0 0
5 3 Document 123 2003 elephants 1 0 0 0
6 4 Empty Space 2004 music 0 1 0 0
7 4 Empty Space 2004 transport 0 0 0 1
数据
document <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L),
Name = c("Contract XYZ","Agreement ABC",
"Document 123","Empty Space"),
Year = c(2000L, 2003L, 2003L, 2004L),
K1 = c("transport", "pens", "elephants", "music"),
K2 = c("elephants", "music", NA, NA),
K50 = c(NA, NA, NA, "transport")
)
> document
ID Name Year K1 K2 K50
1 1 Contract XYZ 2000 transport elephants <NA>
2 2 Agreement ABC 2003 pens music <NA>
3 3 Document 123 2003 elephants <NA> <NA>
4 4 Empty Space 2004 music <NA> transport
我有一个数据框,我正在尝试将其放入一种可用于我的分析的格式,如下所示:
ID | Name | Year | K1 | K2 | ... | K50 |
---|---|---|---|---|---|---|
1 | Contract XYZ | 2000 | transport | elephants | ||
2 | Agreement ABC | 2003 | pens | music | ||
3 | Document 123 | 2003 | elephants | |||
4 | Empty Space | 2004 | music | transport |
基本上,我的文档具有唯一 ID 和签名年份以及一组包含无序关键字的变量名称 K_1 到 K_50。对于每个关键字,我想生成一个虚拟变量(即命名为 transport、pens...),如果 K_1 到 K_50 条目中的任何一个包含该特定字符串。
我已经尝试了下面的代码,如果 K1 或 K2 包含代码,它会创建名为 transport 的虚拟对象,但是对于 50 列中的 50 个关键字,这是大量的手动工作。 理想情况下,我将能够使用字符串作为变量名称,并在所有 50 列中使用 运行 来识别所有关键字并创建假人,如果关键字出现在特定 ID 的任一列中,则假人将是一个假人。但是,我也很乐意手动创建虚拟对象并能够查看所有 50 个关键字而无需全部输入。
document_dummies <- mutate(document,
transport = case_when(
K_1 == "transport" | K_2 == "transport" ~ 1,
TRUE ~ NA_real_
))
这对你有帮助吗?
library(tidyverse)
document <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L),
Name = c("Contract XYZ","Agreement ABC",
"Document 123","Empty Space"),
Year = c(2000L, 2003L, 2003L, 2004L),
K1 = c("transport", "pens", "elephants", "music"),
K2 = c("elephants", "music", NA, NA),
K50 = c(NA, NA, NA, "transport")
)
document %>%
pivot_longer(starts_with("K")) %>%
select(-name) %>%
filter(! is.na(value)) %>%
mutate(has_property = 1) %>%
pivot_wider(names_from = value, values_from = has_property)
#> # A tibble: 4 x 7
#> ID Name Year transport elephants pens music
#> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Contract XYZ 2000 1 1 NA NA
#> 2 2 Agreement ABC 2003 NA NA 1 1
#> 3 3 Document 123 2003 NA 1 NA NA
#> 4 4 Empty Space 2004 1 NA NA 1
由 reprex package (v2.0.1)
创建于 2021-09-21我们可以使用 fastDummies
包,具有 dummy_columns
功能。
使用来自@danlooo
document %>% pivot_longer(matches('K\d+'), names_to = NULL) %>%
filter(!is.na(value)) %>%
fastDummies::dummy_columns('value') %>%
rename_with(~str_remove(.x, '^value_'), starts_with('value_'))
# A tibble: 7 x 8
ID Name Year value elephants music pens transport
<int> <chr> <int> <chr> <int> <int> <int> <int>
1 1 Contract XYZ 2000 transport 0 0 0 1
2 1 Contract XYZ 2000 elephants 1 0 0 0
3 2 Agreement ABC 2003 pens 0 0 1 0
4 2 Agreement ABC 2003 music 0 1 0 0
5 3 Document 123 2003 elephants 1 0 0 0
6 4 Empty Space 2004 music 0 1 0 0
7 4 Empty Space 2004 transport 0 0 0 1
数据
document <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L),
Name = c("Contract XYZ","Agreement ABC",
"Document 123","Empty Space"),
Year = c(2000L, 2003L, 2003L, 2004L),
K1 = c("transport", "pens", "elephants", "music"),
K2 = c("elephants", "music", NA, NA),
K50 = c(NA, NA, NA, "transport")
)
> document
ID Name Year K1 K2 K50
1 1 Contract XYZ 2000 transport elephants <NA>
2 2 Agreement ABC 2003 pens music <NA>
3 3 Document 123 2003 elephants <NA> <NA>
4 4 Empty Space 2004 music <NA> transport