需要 'reshape' 数据框
need to 'reshape' dataframe
数据集:
zip acs.pop napps pperct cgrp zgrp perc
1: 12007 97 2 2.0618557 2 1 25.000000
2: 12007 97 2 2.0618557 NA 2 50.000000
3: 12007 97 2 2.0618557 1 1 25.000000
4: 12008 485 2 0.4123711 2 1 33.333333
5: 12008 485 2 0.4123711 4 1 33.333333
6: 12008 485 2 0.4123711 NA 1 33.333333
7: 12009 7327 187 2.5522042 4 76 26.206897
8: 12009 7327 187 2.5522042 1 41 14.137931
9: 12009 7327 187 2.5522042 2 23 7.931034
10: 12009 7327 187 2.5522042 NA 103 35.517241
11: 12009 7327 187 2.5522042 3 47 16.206897
12: 12010 28802 580 2.0137490 NA 275 32.163743
13: 12010 28802 580 2.0137490 4 122 14.269006
14: 12010 28802 580 2.0137490 1 269 31.461988
15: 12010 28802 580 2.0137490 2 96 11.228070
16: 12010 28802 580 2.0137490 3 93 10.877193
17: 12018 7608 126 1.6561514 3 30 16.129032
18: 12018 7608 126 1.6561514 NA 60 32.258065
19: 12018 7608 126 1.6561514 2 14 7.526882
20: 12018 7608 126 1.6561514 4 57 30.645161
21: 12018 7608 126 1.6561514 1 25 13.440860
22: 12019 14841 144 0.9702850 NA 62 30.097087
23: 12019 14841 144 0.9702850 4 73 35.436893
24: 12019 14841 144 0.9702850 3 30 14.563107
25: 12019 14841 144 0.9702850 1 23 11.165049
26: 12019 14841 144 0.9702850 2 18 8.737864
27: 12020 31403 343 1.0922523 3 76 14.960630
28: 12020 31403 343 1.0922523 1 88 17.322835
29: 12020 31403 343 1.0922523 2 38 7.480315
30: 12020 31403 343 1.0922523 4 141 27.755906
31: 12020 31403 343 1.0922523 NA 165 32.480315
32: 12022 1002 5 0.4990020 NA 4 44.444444
33: 12022 1002 5 0.4990020 4 2 22.222222
34: 12022 1002 5 0.4990020 3 1 11.111111
35: 12022 1002 5 0.4990020 1 1 11.111111
我知道 reshape2 或 reshape 包可以处理这个问题,但我不确定如何处理。我需要最终输出如下所示:
zip acs.pop napps pperct zgrp4 zgrp3 zgrp2 zgrp1 perc4 perc3 perc2 perc1
12009 7327 187 2.5522042 76 47 23 41 26.206897 16.206897 7.931034 14.137931
zip 是 id
acs.pop、napps、pperct 对于每个 zip 组都是相同的
zgrp4…zgrp1 是每个 cgrp 值的 zgrp 值
perc4…perc1 是每个 cgrp 值的 perc 值
这是很好用于tidyr
中的spread()
。
df %>% filter(!is.na(cgrp)) %>% # if cgrp is missing I don't know where to put the obs
gather(Var, Val,6:7) %>% # one row per measure (zgrp OR perc) observed
group_by(zip, acs.pop, napps, pperct) %>% # unique combos of these will define rows in output
unite(Var1,Var,cgrp) %>% # indentify which obs for which measure
spread(Var1, Val) # make columns for zgrp_1, _2, etc., perc1,2, etc
示例输出:
> df2[df2$zip==12009,]
Source: local data frame [1 x 12]
zip acs.pop napps pperct perc_1 perc_2 perc_3 perc_4 zgrp_1 zgrp_2 zgrp_3 zgrp_4
1 12009 7327 187 2.552204 14.13793 7.931034 16.2069 26.2069 41 23 47 76
感谢@akrun 的协助
我们可以尝试 data.table
的开发版本中的 dcast
,它可以包含多个 value.var
列。在这种情况下,我们有 'zgrp' 和 'perc' 是值列。使用分组变量,我们创建一个序列变量 ('ind'),然后使用 dcast
将 'long' 转换为 'wide' 格式。
安装开发版的说明是here
library(data.table)#v1.9.5
setDT(df1)[, ind:= 1:.N, .(zip, acs.pop, napps, pperct)]
dcast(df1, zip+acs.pop + napps+pperct~ind, value.var=c('zgrp', 'perc'))
# zip acs.pop napps pperct 1_zgrp 2_zgrp 3_zgrp 4_zgrp 5_zgrp 1_perc
#1: 12007 97 2 2.0618557 1 2 1 NA NA 25.00000
#2: 12008 485 2 0.4123711 1 1 1 NA NA 33.33333
#3: 12009 7327 187 2.5522042 76 41 23 103 47 26.20690
#4: 12010 28802 580 2.0137490 275 122 269 96 93 32.16374
#5: 12018 7608 126 1.6561514 30 60 14 57 25 16.12903
#6: 12019 14841 144 0.9702850 62 73 30 23 18 30.09709
#7: 12020 31403 343 1.0922523 76 88 38 141 165 14.96063
#8: 12022 1002 5 0.4990020 4 2 1 1 NA 44.44444
# 2_perc 3_perc 4_perc 5_perc
#1: 50.00000 25.000000 NA NA
#2: 33.33333 33.333333 NA NA
#3: 14.13793 7.931034 35.51724 16.206897
#4: 14.26901 31.461988 11.22807 10.877193
#5: 32.25807 7.526882 30.64516 13.440860
#6: 35.43689 14.563107 11.16505 8.737864
#7: 17.32284 7.480315 27.75591 32.480315
#8: 22.22222 11.111111 11.11111 NA
或者我们可以使用 ave/reshape
从 base R
df2 <- transform(df1, ind=ave(seq_along(zip), zip,
acs.pop, napps, pperct, FUN=seq_along))
reshape(df2, idvar=c('zip', 'acs.pop', 'napps', 'pperct'),
timevar='ind', direction='wide')
数据集:
zip acs.pop napps pperct cgrp zgrp perc
1: 12007 97 2 2.0618557 2 1 25.000000
2: 12007 97 2 2.0618557 NA 2 50.000000
3: 12007 97 2 2.0618557 1 1 25.000000
4: 12008 485 2 0.4123711 2 1 33.333333
5: 12008 485 2 0.4123711 4 1 33.333333
6: 12008 485 2 0.4123711 NA 1 33.333333
7: 12009 7327 187 2.5522042 4 76 26.206897
8: 12009 7327 187 2.5522042 1 41 14.137931
9: 12009 7327 187 2.5522042 2 23 7.931034
10: 12009 7327 187 2.5522042 NA 103 35.517241
11: 12009 7327 187 2.5522042 3 47 16.206897
12: 12010 28802 580 2.0137490 NA 275 32.163743
13: 12010 28802 580 2.0137490 4 122 14.269006
14: 12010 28802 580 2.0137490 1 269 31.461988
15: 12010 28802 580 2.0137490 2 96 11.228070
16: 12010 28802 580 2.0137490 3 93 10.877193
17: 12018 7608 126 1.6561514 3 30 16.129032
18: 12018 7608 126 1.6561514 NA 60 32.258065
19: 12018 7608 126 1.6561514 2 14 7.526882
20: 12018 7608 126 1.6561514 4 57 30.645161
21: 12018 7608 126 1.6561514 1 25 13.440860
22: 12019 14841 144 0.9702850 NA 62 30.097087
23: 12019 14841 144 0.9702850 4 73 35.436893
24: 12019 14841 144 0.9702850 3 30 14.563107
25: 12019 14841 144 0.9702850 1 23 11.165049
26: 12019 14841 144 0.9702850 2 18 8.737864
27: 12020 31403 343 1.0922523 3 76 14.960630
28: 12020 31403 343 1.0922523 1 88 17.322835
29: 12020 31403 343 1.0922523 2 38 7.480315
30: 12020 31403 343 1.0922523 4 141 27.755906
31: 12020 31403 343 1.0922523 NA 165 32.480315
32: 12022 1002 5 0.4990020 NA 4 44.444444
33: 12022 1002 5 0.4990020 4 2 22.222222
34: 12022 1002 5 0.4990020 3 1 11.111111
35: 12022 1002 5 0.4990020 1 1 11.111111
我知道 reshape2 或 reshape 包可以处理这个问题,但我不确定如何处理。我需要最终输出如下所示:
zip acs.pop napps pperct zgrp4 zgrp3 zgrp2 zgrp1 perc4 perc3 perc2 perc1
12009 7327 187 2.5522042 76 47 23 41 26.206897 16.206897 7.931034 14.137931
zip 是 id
acs.pop、napps、pperct 对于每个 zip 组都是相同的
zgrp4…zgrp1 是每个 cgrp 值的 zgrp 值
perc4…perc1 是每个 cgrp 值的 perc 值
这是很好用于tidyr
中的spread()
。
df %>% filter(!is.na(cgrp)) %>% # if cgrp is missing I don't know where to put the obs
gather(Var, Val,6:7) %>% # one row per measure (zgrp OR perc) observed
group_by(zip, acs.pop, napps, pperct) %>% # unique combos of these will define rows in output
unite(Var1,Var,cgrp) %>% # indentify which obs for which measure
spread(Var1, Val) # make columns for zgrp_1, _2, etc., perc1,2, etc
示例输出:
> df2[df2$zip==12009,]
Source: local data frame [1 x 12]
zip acs.pop napps pperct perc_1 perc_2 perc_3 perc_4 zgrp_1 zgrp_2 zgrp_3 zgrp_4
1 12009 7327 187 2.552204 14.13793 7.931034 16.2069 26.2069 41 23 47 76
感谢@akrun 的协助
我们可以尝试 data.table
的开发版本中的 dcast
,它可以包含多个 value.var
列。在这种情况下,我们有 'zgrp' 和 'perc' 是值列。使用分组变量,我们创建一个序列变量 ('ind'),然后使用 dcast
将 'long' 转换为 'wide' 格式。
安装开发版的说明是here
library(data.table)#v1.9.5
setDT(df1)[, ind:= 1:.N, .(zip, acs.pop, napps, pperct)]
dcast(df1, zip+acs.pop + napps+pperct~ind, value.var=c('zgrp', 'perc'))
# zip acs.pop napps pperct 1_zgrp 2_zgrp 3_zgrp 4_zgrp 5_zgrp 1_perc
#1: 12007 97 2 2.0618557 1 2 1 NA NA 25.00000
#2: 12008 485 2 0.4123711 1 1 1 NA NA 33.33333
#3: 12009 7327 187 2.5522042 76 41 23 103 47 26.20690
#4: 12010 28802 580 2.0137490 275 122 269 96 93 32.16374
#5: 12018 7608 126 1.6561514 30 60 14 57 25 16.12903
#6: 12019 14841 144 0.9702850 62 73 30 23 18 30.09709
#7: 12020 31403 343 1.0922523 76 88 38 141 165 14.96063
#8: 12022 1002 5 0.4990020 4 2 1 1 NA 44.44444
# 2_perc 3_perc 4_perc 5_perc
#1: 50.00000 25.000000 NA NA
#2: 33.33333 33.333333 NA NA
#3: 14.13793 7.931034 35.51724 16.206897
#4: 14.26901 31.461988 11.22807 10.877193
#5: 32.25807 7.526882 30.64516 13.440860
#6: 35.43689 14.563107 11.16505 8.737864
#7: 17.32284 7.480315 27.75591 32.480315
#8: 22.22222 11.111111 11.11111 NA
或者我们可以使用 ave/reshape
从 base R
df2 <- transform(df1, ind=ave(seq_along(zip), zip,
acs.pop, napps, pperct, FUN=seq_along))
reshape(df2, idvar=c('zip', 'acs.pop', 'napps', 'pperct'),
timevar='ind', direction='wide')