如何使用 dplyr 收集一个事件的多个实例并创建一个整洁的 tibble

How to use dplyr to gather multiple instances of an event and create a tidy tibble

我有一个类似这样的数据集:

library(tidyverse)

df <- tibble(
  subjid = 1:5,
  event_1 = c("Watery eyes",         # Event number 1 
          "Sore throat",
          "Vomiting",
          "Gastroenteritis viral",
          "Dry Mouth"),
  start_date_1 = as.Date("2017-01-02") + 0:4,
  stop_date_1 = as.Date("2017-01-03") + 0:4,
  severity_1 = 1,
  related_to_drug_1 = 0,
  event_2 = c("Nausea",             # Event number 2
          "Dizziness",
          "Cough",
          "Disorientation",
          "Diarrhea"),
  start_date_2 = as.Date("2017-02-02") + 0:4,
  stop_date_2 = as.Date("2017-02-03") + 0:4,
  severity_2 = 2,
  related_to_drug_2 = 1,
  event_3 = c("Eczema",             # Event number 3
          "Sinusitis",
          "Abdominal discomfort",
          "Muscle spasms",
          "Nasopharyngitis"),
  start_date_3 = as.Date("2017-03-02") + 0:4,
  stop_date_3 = as.Date("2017-03-03") + 0:4,
  severity_3 = 2,
  related_to_drug_3 = 1
)
df

# A tibble: 5 × 16
  subjid               event_1 start_date_1 stop_date_1 severity_1 related_to_drug_1        event_2 start_date_2 stop_date_2 severity_2 related_to_drug_2              event_3
   <int>                 <chr>       <date>      <date>      <dbl>             <dbl>          <chr>       <date>      <date>      <dbl>             <dbl>                <chr>
1      1           Watery eyes   2017-01-02  2017-01-03          1                 0         Nausea   2017-02-02  2017-02-03          2                 1               Eczema
2      2           Sore throat   2017-01-03  2017-01-04          1                 0      Dizziness   2017-02-03  2017-02-04          2                 1            Sinusitis
3      3              Vomiting   2017-01-04  2017-01-05          1                 0          Cough   2017-02-04  2017-02-05          2                 1 Abdominal discomfort
4      4 Gastroenteritis viral   2017-01-05  2017-01-06          1                 0 Disorientation   2017-02-05  2017-02-06          2                 1        Muscle spasms
5      5             Dry Mouth   2017-01-06  2017-01-07          1                 0       Diarrhea   2017-02-06  2017-02-07          2                 1      Nasopharyngitis
# ... with 4 more variables: start_date_3 <date>, stop_date_3 <date>, severity_3 <dbl>, related_to_drug_3 <dbl>

但是,有更多的数据行和超过 100 "events"/系列的列。数据框由每个受试者一行组成,其中包含不良事件及其相关属性,列在列中,这些列以下划线命名以指示它们属于哪个事件。我想使用 tidyr 将这些事件收集成这样的小标题:

# A tibble: 15 × 7
   subjid event_number                 event start_date  stop_date severity related_to_drug
    <int>        <int>                 <chr>     <date>     <date>    <int>                <int>
1       1            1           Watery eyes 2017-01-02 2017-01-03        1                    0
2       2            1           Sore throat 2017-01-03 2017-01-04        1                    0
3       3            1              Vomiting 2017-01-04 2017-01-05        1                    0
4       4            1 Gastroenteritis viral 2017-01-05 2017-01-06        1                    0
5       5            1             Dry Mouth 2017-01-06 2017-01-07        1                    0
6       1            2                Nausea 2017-02-02 2017-02-03        2                    1
7       2            2             Dizziness 2017-02-03 2017-02-04        2                    1
8       3            2                 Cough 2017-02-04 2017-02-05        2                    1
9       4            2        Disorientation 2017-02-05 2017-02-06        2                    1
10      5            2              Diarrhea 2017-02-06 2017-02-07        2                    1
11      1            3                Eczema 2017-03-02 2017-03-03        3                    2
12      2            3             Sinusitis 2017-03-03 2017-03-04        3                    2
13      3            3  Abdominal discomfort 2017-03-04 2017-03-05        3                    2
14      4            3         Muscle spasms 2017-03-05 2017-03-06        3                    2
15      5            3       Nasopharyngitis 2017-03-06 2017-03-07        3                    2

每个不良事件各占一行,各列标识该特定事件的属性。

您可以使用以下代码执行此操作:

df %>%
  gather(Var,Val,-1) %>%
  mutate(Var = gsub('_(\d+)','!!\1',Var)) %>% 
  separate(Var,c('Var','Event'),sep = '!!') %>%
  spread(Var,Val)

不幸的是,这会破坏列的 class,这需要修复,您可以调用 mutate.

(另请注意,gather 后的 mutate 行只是因为您的列名中有“_”,我想拆分事件编号。)

一种更复杂的方法,但非常重要的是,保留 类.
从列名开始,按照事件编号拆分,然后每个事件做一个dataframe,最后垂直堆叠:

names(df) %>% 
  setdiff("subjid") %>% 
  split(sub(".*_(\d+)$", "\1", x = .)) %>% 
  map(~ select_(.data = df, .dots = c("subjid", .x))) %>% 
  map(~ setNames(.x, nm = sub("(.*)_\d+$", "\1", x = names(.x)))) %>%
  map2(names(.), ~ mutate(.x, event_number = .y)) %>% 
  bind_rows() %>% 
  select(subjid, event_number, everything())
# # A tibble: 15 × 7
#    subjid event_number                 event start_date  stop_date severity related_to_drug
#     <int>        <chr>                 <chr>     <date>     <date>    <dbl>           <dbl>
# 1       1            1           Watery eyes 2017-01-02 2017-01-03        1               0
# 2       2            1           Sore throat 2017-01-03 2017-01-04        1               0
# 3       3            1              Vomiting 2017-01-04 2017-01-05        1               0
# 4       4            1 Gastroenteritis viral 2017-01-05 2017-01-06        1               0
# 5       5            1             Dry Mouth 2017-01-06 2017-01-07        1               0
# 6       1            2                Nausea 2017-02-02 2017-02-03        2               1
# 7       2            2             Dizziness 2017-02-03 2017-02-04        2               1
# 8       3            2                 Cough 2017-02-04 2017-02-05        2               1
# 9       4            2        Disorientation 2017-02-05 2017-02-06        2               1
# 10      5            2              Diarrhea 2017-02-06 2017-02-07        2               1
# 11      1            3                Eczema 2017-03-02 2017-03-03        2               1
# 12      2            3             Sinusitis 2017-03-03 2017-03-04        2               1
# 13      3            3  Abdominal discomfort 2017-03-04 2017-03-05        2               1
# 14      4            3         Muscle spasms 2017-03-05 2017-03-06        2               1
# 15      5            3       Nasopharyngitis 2017-03-06 2017-03-07        2               1