如何根据 R 中具有相同前缀的其他几个列中的值生成一组虚拟变量?

How to generate a set of dummy variables dependent on values in several other columns with same prefix in R?

我有一个数据框,我正在尝试将其放入一种可用于我的分析的格式,如下所示:

ID Name Year K1 K2 ... K50
1 Contract XYZ 2000 transport elephants
2 Agreement ABC 2003 pens music
3 Document 123 2003 elephants
4 Empty Space 2004 music transport

基本上,我的文档具有唯一 ID 和签名年份以及一组包含无序关键字的变量名称 K_1 到 K_50。对于每个关键字,我想生成一个虚拟变量(即命名为 transport、pens...),如果 K_1 到 K_50 条目中的任何一个包含该特定字符串。

我已经尝试了下面的代码,如果 K1 或 K2 包含代码,它会创建名为 transport 的虚拟对象,但是对于 50 列中的 50 个关键字,这是大量的手动工作。 理想情况下,我将能够使用字符串作为变量名称,并在所有 50 列中使用 运行 来识别所有关键字并创建假人,如果关键字出现在特定 ID 的任一列中,则假人将是一个假人。但是,我也很乐意手动创建虚拟对象并能够查看所有 50 个关键字而无需全部输入。

document_dummies <- mutate(document,
                  transport = case_when(
                    K_1 == "transport" | K_2 == "transport" ~ 1,
                    TRUE ~ NA_real_ 
                  ))          

这对你有帮助吗?

library(tidyverse)

document <- data.frame(
  stringsAsFactors = FALSE,
  ID = c(1L, 2L, 3L, 4L),
  Name = c("Contract XYZ","Agreement ABC",
           "Document 123","Empty Space"),
  Year = c(2000L, 2003L, 2003L, 2004L),
  K1 = c("transport", "pens", "elephants", "music"),
  K2 = c("elephants", "music", NA, NA),
  K50 = c(NA, NA, NA, "transport")
)
document %>%
  pivot_longer(starts_with("K")) %>%
  select(-name) %>%
  filter(! is.na(value)) %>%
  mutate(has_property = 1) %>%
  pivot_wider(names_from = value, values_from = has_property)
#> # A tibble: 4 x 7
#>      ID Name           Year transport elephants  pens music
#>   <int> <chr>         <int>     <dbl>     <dbl> <dbl> <dbl>
#> 1     1 Contract XYZ   2000         1         1    NA    NA
#> 2     2 Agreement ABC  2003        NA        NA     1     1
#> 3     3 Document 123   2003        NA         1    NA    NA
#> 4     4 Empty Space    2004         1        NA    NA     1

reprex package (v2.0.1)

创建于 2021-09-21

我们可以使用 fastDummies 包,具有​​ dummy_columns 功能。 使用来自@danlooo

的示例数据
document %>% pivot_longer(matches('K\d+'), names_to = NULL) %>%
        filter(!is.na(value)) %>%
        fastDummies::dummy_columns('value') %>%
        rename_with(~str_remove(.x, '^value_'), starts_with('value_'))

# A tibble: 7 x 8
     ID Name           Year value     elephants music  pens transport
  <int> <chr>         <int> <chr>         <int> <int> <int>     <int>
1     1 Contract XYZ   2000 transport         0     0     0         1
2     1 Contract XYZ   2000 elephants         1     0     0         0
3     2 Agreement ABC  2003 pens              0     0     1         0
4     2 Agreement ABC  2003 music             0     1     0         0
5     3 Document 123   2003 elephants         1     0     0         0
6     4 Empty Space    2004 music             0     1     0         0
7     4 Empty Space    2004 transport         0     0     0         1

数据

document <- data.frame(
        stringsAsFactors = FALSE,
        ID = c(1L, 2L, 3L, 4L),
        Name = c("Contract XYZ","Agreement ABC",
                 "Document 123","Empty Space"),
        Year = c(2000L, 2003L, 2003L, 2004L),
        K1 = c("transport", "pens", "elephants", "music"),
        K2 = c("elephants", "music", NA, NA),
        K50 = c(NA, NA, NA, "transport")
)

> document
  ID          Name Year        K1        K2       K50
1  1  Contract XYZ 2000 transport elephants      <NA>
2  2 Agreement ABC 2003      pens     music      <NA>
3  3  Document 123 2003 elephants      <NA>      <NA>
4  4   Empty Space 2004     music      <NA> transport