需要 'reshape' 数据框

Question

数据集：

      zip acs.pop napps    pperct cgrp zgrp      perc  
 1: 12007      97     2 2.0618557    2    1 25.000000  
 2: 12007      97     2 2.0618557   NA    2 50.000000  
 3: 12007      97     2 2.0618557    1    1 25.000000  
 4: 12008     485     2 0.4123711    2    1 33.333333  
 5: 12008     485     2 0.4123711    4    1 33.333333  
 6: 12008     485     2 0.4123711   NA    1 33.333333  
 7: 12009    7327   187 2.5522042    4   76 26.206897  
 8: 12009    7327   187 2.5522042    1   41 14.137931  
 9: 12009    7327   187 2.5522042    2   23  7.931034  
10: 12009    7327   187 2.5522042   NA  103 35.517241  
11: 12009    7327   187 2.5522042    3   47 16.206897  
12: 12010   28802   580 2.0137490   NA  275 32.163743  
13: 12010   28802   580 2.0137490    4  122 14.269006  
14: 12010   28802   580 2.0137490    1  269 31.461988  
15: 12010   28802   580 2.0137490    2   96 11.228070  
16: 12010   28802   580 2.0137490    3   93 10.877193  
17: 12018    7608   126 1.6561514    3   30 16.129032  
18: 12018    7608   126 1.6561514   NA   60 32.258065  
19: 12018    7608   126 1.6561514    2   14  7.526882  
20: 12018    7608   126 1.6561514    4   57 30.645161  
21: 12018    7608   126 1.6561514    1   25 13.440860  
22: 12019   14841   144 0.9702850   NA   62 30.097087  
23: 12019   14841   144 0.9702850    4   73 35.436893  
24: 12019   14841   144 0.9702850    3   30 14.563107  
25: 12019   14841   144 0.9702850    1   23 11.165049  
26: 12019   14841   144 0.9702850    2   18  8.737864  
27: 12020   31403   343 1.0922523    3   76 14.960630  
28: 12020   31403   343 1.0922523    1   88 17.322835  
29: 12020   31403   343 1.0922523    2   38  7.480315  
30: 12020   31403   343 1.0922523    4  141 27.755906  
31: 12020   31403   343 1.0922523   NA  165 32.480315  
32: 12022    1002     5 0.4990020   NA    4 44.444444  
33: 12022    1002     5 0.4990020    4    2 22.222222  
34: 12022    1002     5 0.4990020    3    1 11.111111  
35: 12022    1002     5 0.4990020    1    1 11.111111

我知道 reshape2 或 reshape 包可以处理这个问题，但我不确定如何处理。我需要最终输出如下所示：

zip  acs.pop    napps   pperct  zgrp4   zgrp3   zgrp2   zgrp1   perc4   perc3   perc2   perc1  
12009   7327    187 2.5522042   76  47  23  41  26.206897   16.206897   7.931034    14.137931

zip 是 id
acs.pop、napps、pperct 对于每个 zip 组都是相同的
zgrp4…zgrp1 是每个 cgrp 值的 zgrp 值
perc4…perc1 是每个 cgrp 值的 perc 值

Answer 1

这是很好用于tidyr中的spread()。

df %>% filter(!is.na(cgrp)) %>% # if cgrp is missing I don't know where to put the obs
  gather(Var, Val,6:7) %>% # one row per measure (zgrp OR perc) observed
  group_by(zip, acs.pop, napps, pperct) %>% # unique combos of these will define rows in output
  unite(Var1,Var,cgrp) %>% # indentify which obs for which measure
  spread(Var1, Val) # make columns for zgrp_1, _2, etc., perc1,2, etc

示例输出：

> df2[df2$zip==12009,]
Source: local data frame [1 x 12]

    zip acs.pop napps   pperct   perc_1   perc_2  perc_3  perc_4 zgrp_1 zgrp_2 zgrp_3 zgrp_4
1 12009    7327   187 2.552204 14.13793 7.931034 16.2069 26.2069     41     23     47     76

感谢@akrun 的协助

Answer 2

我们可以尝试 data.table 的开发版本中的 dcast，它可以包含多个 value.var 列。在这种情况下，我们有 'zgrp' 和 'perc' 是值列。使用分组变量，我们创建一个序列变量 ('ind')，然后使用 dcast 将 'long' 转换为 'wide' 格式。

安装开发版的说明是here

 library(data.table)#v1.9.5
 setDT(df1)[, ind:= 1:.N, .(zip, acs.pop, napps, pperct)]
 dcast(df1, zip+acs.pop + napps+pperct~ind, value.var=c('zgrp', 'perc'))
 #     zip acs.pop napps    pperct 1_zgrp 2_zgrp 3_zgrp 4_zgrp 5_zgrp   1_perc
#1: 12007      97     2 2.0618557      1      2      1     NA     NA 25.00000
#2: 12008     485     2 0.4123711      1      1      1     NA     NA 33.33333
#3: 12009    7327   187 2.5522042     76     41     23    103     47 26.20690
#4: 12010   28802   580 2.0137490    275    122    269     96     93 32.16374
#5: 12018    7608   126 1.6561514     30     60     14     57     25 16.12903
#6: 12019   14841   144 0.9702850     62     73     30     23     18 30.09709
#7: 12020   31403   343 1.0922523     76     88     38    141    165 14.96063
#8: 12022    1002     5 0.4990020      4      2      1      1     NA 44.44444
#     2_perc    3_perc   4_perc    5_perc
#1: 50.00000 25.000000       NA        NA
#2: 33.33333 33.333333       NA        NA
#3: 14.13793  7.931034 35.51724 16.206897
#4: 14.26901 31.461988 11.22807 10.877193
#5: 32.25807  7.526882 30.64516 13.440860
#6: 35.43689 14.563107 11.16505  8.737864
#7: 17.32284  7.480315 27.75591 32.480315
#8: 22.22222 11.111111 11.11111        NA

或者我们可以使用 ave/reshape 从 base R

 df2 <- transform(df1, ind=ave(seq_along(zip), zip,
                 acs.pop, napps, pperct, FUN=seq_along))
 reshape(df2, idvar=c('zip', 'acs.pop', 'napps', 'pperct'),
                               timevar='ind', direction='wide')

需要 'reshape' 数据框

need to 'reshape' dataframe

r

reshape