Pivot_wider:合并重复的观察值并为这些值创建新的变量列
Pivot_wider: Combine Duplicate Observations AND Create New Variable Columns for Those Values
我是 R 的新手,已经在网站上搜索以找到解决方案 - 我发现了很多类似但略有不同的问题。我被难住了。
我有一个这种结构的数据集:
SURVEY_ID CHILD_NAME CHILD_AGE
Survey1 Billy 4
Survey2 Claude 12
Survey2 Maude 6
Survey2 Constance 3
Survey3 George 22
Survey4 Marjoram 14
Survey4 LeBron 37
我正在尝试将数据旋转得更宽,以便 a) 每行只有一个唯一的 SURVEY_ID,并且至关重要的是,b) 第二个、第三个等的新列 children 表示超过一项的调查 child.
所以结果看起来像:
SURVEY_ID CHILD_NAME1 CHILD_NAME2 CHILD_NAME3 CHILD_AGE1 CHILD_AGE2 CHILD_AGE3
Survey1 Billy 4
Survey2 Claude Maude Constance 12 6 3
Survey3 George 22
Survey4 Marjoram Lebron 14 37
实际数据有数千个调查,“child姓名”和“child年龄”的数量可能高达10个。这是创建新列的问题,而不是从现有的值名称,只有在有多个 children 的地方让我感到困惑。
使用基数 R:
reshape(transform(df, time = ave(SURVEY_ID, SURVEY_ID, FUN=seq)),
v.names = c('CHILD_NAME', 'CHILD_AGE'),
direction = 'wide', idvar = 'SURVEY_ID', sep = '_')
SURVEY_ID CHILD_NAME_1 CHILD_AGE_1 CHILD_NAME_2 CHILD_AGE_2 CHILD_NAME_3 CHILD_AGE_3
1 Survey1 Billy 4 <NA> NA <NA> NA
2 Survey2 Claude 12 Maude 6 Constance 3
5 Survey3 George 22 <NA> NA <NA> NA
6 Survey4 Marjoram 14 LeBron 37 <NA> NA
使用 tidyverse:
library(tidyverse)
df %>%
group_by(SURVEY_ID) %>%
mutate(name = row_number()) %>%
pivot_wider(SURVEY_ID, values_from = c(CHILD_NAME, CHILD_AGE))
# A tibble: 4 x 7
# Groups: SURVEY_ID [4]
SURVEY_ID CHILD_NAME_1 CHILD_NAME_2 CHILD_NAME_3 CHILD_AGE_1 CHILD_AGE_2 CHILD_AGE_3
<chr> <chr> <chr> <chr> <int> <int> <int>
1 Survey1 Billy NA NA 4 NA NA
2 Survey2 Claude Maude Constance 12 6 3
3 Survey3 George NA NA 22 NA NA
4 Survey4 Marjoram LeBron NA 14 37 NA
使用data.table
library(data.table)
dcast(setDT(df), SURVEY_ID~rowid(SURVEY_ID), value.var = c('CHILD_AGE', 'CHILD_NAME'))
SURVEY_ID CHILD_AGE_1 CHILD_AGE_2 CHILD_AGE_3 CHILD_NAME_1 CHILD_NAME_2 CHILD_NAME_3
1: Survey1 4 NA NA Billy <NA> <NA>
2: Survey2 12 6 3 Claude Maude Constance
3: Survey3 22 NA NA George <NA> <NA>
4: Survey4 14 37 NA Marjoram LeBron <NA>
我是 R 的新手,已经在网站上搜索以找到解决方案 - 我发现了很多类似但略有不同的问题。我被难住了。
我有一个这种结构的数据集:
SURVEY_ID CHILD_NAME CHILD_AGE
Survey1 Billy 4
Survey2 Claude 12
Survey2 Maude 6
Survey2 Constance 3
Survey3 George 22
Survey4 Marjoram 14
Survey4 LeBron 37
我正在尝试将数据旋转得更宽,以便 a) 每行只有一个唯一的 SURVEY_ID,并且至关重要的是,b) 第二个、第三个等的新列 children 表示超过一项的调查 child.
所以结果看起来像:
SURVEY_ID CHILD_NAME1 CHILD_NAME2 CHILD_NAME3 CHILD_AGE1 CHILD_AGE2 CHILD_AGE3
Survey1 Billy 4
Survey2 Claude Maude Constance 12 6 3
Survey3 George 22
Survey4 Marjoram Lebron 14 37
实际数据有数千个调查,“child姓名”和“child年龄”的数量可能高达10个。这是创建新列的问题,而不是从现有的值名称,只有在有多个 children 的地方让我感到困惑。
使用基数 R:
reshape(transform(df, time = ave(SURVEY_ID, SURVEY_ID, FUN=seq)),
v.names = c('CHILD_NAME', 'CHILD_AGE'),
direction = 'wide', idvar = 'SURVEY_ID', sep = '_')
SURVEY_ID CHILD_NAME_1 CHILD_AGE_1 CHILD_NAME_2 CHILD_AGE_2 CHILD_NAME_3 CHILD_AGE_3
1 Survey1 Billy 4 <NA> NA <NA> NA
2 Survey2 Claude 12 Maude 6 Constance 3
5 Survey3 George 22 <NA> NA <NA> NA
6 Survey4 Marjoram 14 LeBron 37 <NA> NA
使用 tidyverse:
library(tidyverse)
df %>%
group_by(SURVEY_ID) %>%
mutate(name = row_number()) %>%
pivot_wider(SURVEY_ID, values_from = c(CHILD_NAME, CHILD_AGE))
# A tibble: 4 x 7
# Groups: SURVEY_ID [4]
SURVEY_ID CHILD_NAME_1 CHILD_NAME_2 CHILD_NAME_3 CHILD_AGE_1 CHILD_AGE_2 CHILD_AGE_3
<chr> <chr> <chr> <chr> <int> <int> <int>
1 Survey1 Billy NA NA 4 NA NA
2 Survey2 Claude Maude Constance 12 6 3
3 Survey3 George NA NA 22 NA NA
4 Survey4 Marjoram LeBron NA 14 37 NA
使用data.table
library(data.table)
dcast(setDT(df), SURVEY_ID~rowid(SURVEY_ID), value.var = c('CHILD_AGE', 'CHILD_NAME'))
SURVEY_ID CHILD_AGE_1 CHILD_AGE_2 CHILD_AGE_3 CHILD_NAME_1 CHILD_NAME_2 CHILD_NAME_3
1: Survey1 4 NA NA Billy <NA> <NA>
2: Survey2 12 6 3 Claude Maude Constance
3: Survey3 22 NA NA George <NA> <NA>
4: Survey4 14 37 NA Marjoram LeBron <NA>