Pivot_wider:合并重复的观察值并为这些值创建新的变量列

Pivot_wider: Combine Duplicate Observations AND Create New Variable Columns for Those Values

我是 R 的新手,已经在网站上搜索以找到解决方案 - 我发现了很多类似但略有不同的问题。我被难住了。

我有一个这种结构的数据集:

  SURVEY_ID    CHILD_NAME    CHILD_AGE
  Survey1      Billy             4
  Survey2      Claude            12
  Survey2      Maude             6
  Survey2      Constance         3
  Survey3      George            22
  Survey4      Marjoram          14
  Survey4      LeBron            37

我正在尝试将数据旋转得更宽,以便 a) 每行只有一个唯一的 SURVEY_ID,并且至关重要的是,b) 第二个、第三个等的新列 children 表示超过一项的调查 child.

所以结果看起来像:

    SURVEY_ID    CHILD_NAME1    CHILD_NAME2    CHILD_NAME3    CHILD_AGE1  CHILD_AGE2  CHILD_AGE3
    Survey1      Billy                                        4
    Survey2      Claude         Maude          Constance      12          6           3
    Survey3      George                                       22
    Survey4      Marjoram       Lebron                        14          37

实际数据有数千个调查,“child姓名”和“child年龄”的数量可能高达10个。这是创建新列的问题,而不是从现有的值名称,只有在有多个 children 的地方让我感到困惑。

使用基数 R:

reshape(transform(df, time = ave(SURVEY_ID, SURVEY_ID, FUN=seq)), 
       v.names = c('CHILD_NAME', 'CHILD_AGE'), 
       direction = 'wide', idvar = 'SURVEY_ID', sep = '_')

  SURVEY_ID CHILD_NAME_1 CHILD_AGE_1 CHILD_NAME_2 CHILD_AGE_2 CHILD_NAME_3 CHILD_AGE_3
1   Survey1        Billy           4         <NA>          NA         <NA>          NA
2   Survey2       Claude          12        Maude           6    Constance           3
5   Survey3       George          22         <NA>          NA         <NA>          NA
6   Survey4     Marjoram          14       LeBron          37         <NA>          NA

使用 tidyverse:

library(tidyverse)
df %>%
  group_by(SURVEY_ID) %>%
  mutate(name = row_number()) %>%
  pivot_wider(SURVEY_ID, values_from = c(CHILD_NAME, CHILD_AGE))

# A tibble: 4 x 7
# Groups:   SURVEY_ID [4]
  SURVEY_ID CHILD_NAME_1 CHILD_NAME_2 CHILD_NAME_3 CHILD_AGE_1 CHILD_AGE_2 CHILD_AGE_3
  <chr>     <chr>        <chr>        <chr>              <int>       <int>       <int>
1 Survey1   Billy        NA           NA                     4          NA          NA
2 Survey2   Claude       Maude        Constance             12           6           3
3 Survey3   George       NA           NA                    22          NA          NA
4 Survey4   Marjoram     LeBron       NA                    14          37          NA

使用data.table

library(data.table)
dcast(setDT(df), SURVEY_ID~rowid(SURVEY_ID), value.var = c('CHILD_AGE', 'CHILD_NAME'))
   SURVEY_ID CHILD_AGE_1 CHILD_AGE_2 CHILD_AGE_3 CHILD_NAME_1 CHILD_NAME_2 CHILD_NAME_3
1:   Survey1           4          NA          NA        Billy         <NA>         <NA>
2:   Survey2          12           6           3       Claude        Maude    Constance
3:   Survey3          22          NA          NA       George         <NA>         <NA>
4:   Survey4          14          37          NA     Marjoram       LeBron         <NA>