从字符串创建 presence/absence 个变量用于长数据

Create presence/absence variables from character string for long data

假设我有一个这样的数据框:

dat<- data.frame(ID= rep(c("A","B","C","D"),4),
             test= rep(c("pre","post"),8),
             item= c(rep("item1",8),rep("item2",8))
             answer= c("undergraduateeducation_graduateorprofessionalschool_employment", 
                       "graduateorprofessionalschool",
                       "undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
                       "volunteeractivityoroutreach", 
                       "undergraduateeducation_employment_volunteeractivityoroutreach",
                       "employment",
                       "volunteeractivityoroutreach",
                       "undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
                       "undergraduateeducation_graduateorprofessionalschool_employment", 
                       "graduateorprofessionalschool",
                       "undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
                       "volunteeractivityoroutreach", 
                       "undergraduateeducation_employment_volunteeractivityoroutreach",
                       "employment",
                       "volunteeractivityoroutreach",
                       "undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach"))

answer 列表示“select 所有适用”答案类型 - 其中下划线分隔 selected 答案选项。对于每个 IDtestitem,我想将这个单个变量更改为多个 presence/absence 变量,指示字符串中是否存在该答案组件。 1 表示受访者答案中存在答案选项,0 表示该组件不存在。 res中的变量undergraduategraduateemploymentvolunteer分别对应answer中的以下字符串:undergraduateeducation , graduateorprofessionalschool,employment,volunteeractivityoroutreach。删除了空格。

结果数据框如下所示:

res<- data.frame(ID= rep(c("A","B","C","D"),4),
             test= rep(c("pre","post"),8),
             item= c(rep("item1",8),rep("item2",8)),
             undergraduate= c(1,0,1,0,1,0,0,1,1,0,1,0,1,0,0,1),
             graduate= c(1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1),
             employment=c(1,0,1,0,1,1,0,1,1,0,1,0,1,1,0,1),
             volunteer=c(0,0,1,1,1,0,1,1,0,0,1,1,1,0,1,1))

在基础 R 中你可以这样做:

new_cols <- c('undergraduate', 'graduate', 'employment', 'volunteer')

cbind(dat[1:3],
      as.data.frame(do.call(rbind, lapply(strsplit(dat$answer, "_"), 
      function(x) {
        z <- sapply(new_cols, function(y) as.numeric(grepl(paste0("\b", y), x)))
        if(is.vector(z)) z else colSums(z)
      }))))
#>    ID test  item undergraduate graduate employment volunteer
#> 1   A  pre item1             1        1          1         0
#> 2   B post item1             0        1          0         0
#> 3   C  pre item1             1        1          1         1
#> 4   D post item1             0        0          0         1
#> 5   A  pre item1             1        0          1         1
#> 6   B post item1             0        0          1         0
#> 7   C  pre item1             0        0          0         1
#> 8   D post item1             1        1          1         1
#> 9   A  pre item2             1        1          1         0
#> 10  B post item2             0        1          0         0
#> 11  C  pre item2             1        1          1         1
#> 12  D post item2             0        0          0         1
#> 13  A  pre item2             1        0          1         1
#> 14  B post item2             0        0          1         0
#> 15  C  pre item2             0        0          0         1
#> 16  D post item2             1        1          1         1

reprex package (v2.0.1)

于 2022-05-05 创建

一种选择是使用 tidyverse 将数据分成 _ 上的行,然后仅保留关键字(将用于列名)。然后,我们创建一个值列来记录存在,然后我们可以转向宽格式,并用 0 填充其他值。

library(tidyverse)

result <- dat %>%
  mutate(rn = row_number()) %>% 
  separate_rows(answer, sep = "_") %>%
  mutate(answer = str_extract(answer, "undergraduate|graduate|employment|volunteer"),
         value = 1) %>% 
  pivot_wider(names_from = "answer", values_from = "value", values_fill = 0) %>% 
  select(-rn)

输出

   ID    test  item  undergraduate graduate employment volunteer
   <chr> <chr> <chr>         <dbl>    <dbl>      <dbl>     <dbl>
 1 A     pre   item1             1        1          1         0
 2 B     post  item1             0        1          0         0
 3 C     pre   item1             1        1          1         1
 4 D     post  item1             0        0          0         1
 5 A     pre   item1             1        0          1         1
 6 B     post  item1             0        0          1         0
 7 C     pre   item1             0        0          0         1
 8 D     post  item1             1        1          1         1
 9 A     pre   item2             1        1          1         0
10 B     post  item2             0        1          0         0
11 C     pre   item2             1        1          1         1
12 D     post  item2             0        0          0         1
13 A     pre   item2             1        0          1         1
14 B     post  item2             0        0          1         0
15 C     pre   item2             0        0          0         1
16 D     post  item2             1        1          1         1

测试

identical(result, as_tibble(res))

#[1] TRUE