在重叠观察中合并不同长度的数据帧及其平均值
Merging different length data frames with their mean values in the overlapping observations
例如我有 3 个数据帧:
test.df1
date x y z
1 1998-01-01 10 10 10
2 1998-02-01 10 10 10
3 1998-03-01 10 10 10
4 1998-04-01 10 10 10
5 1998-05-01 10 10 10
6 1998-06-01 10 10 10
test.df2
date x y z
1 1998-03-01 5 5 5
2 1998-04-01 5 5 5
3 1998-05-01 5 5 5
4 1998-06-01 5 5 5
test.df3
date x y z
1 1998-05-01 1 1 1
2 1998-06-01 1 1 1
我想将它们合并在一起,以便新数据框的行数与最大数据框中的行数相同(本例中为 test.df1
),并且当日期重叠时,平均值变量被添加到新的数据框中。在上面的示例中,新数据框应具有 4
列和 6
行。
x
、y
和 z
对于 1998-01-01
和 1998-02-01
应保持 10
;
对于 1998-03-01
到 1998-06-01
应该是 7.5
(10 + 5 的平均值);
而对于 1998-05-01
和 1998-06-01
应该是 5.33
(10 + 5 + 1 的平均值)
有没有办法在 r 中做到这一点?
dput(test.df1)
structure(list(date = structure(c(10227, 10258, 10286, 10317,
10347, 10378), class = "Date"), x = c(10, 10, 10, 10, 10, 10),
y = c(10, 10, 10, 10, 10, 10), z = c(10, 10, 10, 10, 10,
10)), .Names = c("date", "x", "y", "z"), row.names = c(NA,
-6L), class = "data.frame")
dput(test.df2)
structure(list(date = structure(c(10286, 10317, 10347, 10378), class = "Date"),
x = c(5, 5, 5, 5), y = c(5, 5, 5, 5), z = c(5, 5, 5, 5)), .Names = c("date",
"x", "y", "z"), row.names = c(NA, -4L), class = "data.frame")
dput(test.df3)
structure(list(date = structure(c(10347, 10378), class = "Date"),
x = c(1, 1), y = c(1, 1), z = c(1, 1)), .Names = c("date",
"x", "y", "z"), row.names = c(NA, -2L), class = "data.frame")
我的方法是首先将数据帧与重复项绑定,然后使用 dplyr 包(执行 colMeans 时,确保排除非数字列):
library(plyr)
test.merge <- rbind(test.df1, test.df2, test.df3)
test.merge <- ddply(test.merge, ~date, function(x){
colMeans(x[,-1])
})
输出:
date x y z
1 1998-01-01 10.000000 10.000000 10.000000
2 1998-02-01 10.000000 10.000000 10.000000
3 1998-03-01 7.500000 7.500000 7.500000
4 1998-04-01 7.500000 7.500000 7.500000
5 1998-05-01 5.333333 5.333333 5.333333
6 1998-06-01 5.333333 5.333333 5.333333
我们可以使用 dplyr
和 tidyr
:
library(dplyr)
library(tidyr)
test.df1 %>% left_join(test.df2, by = "date") %>%
left_join(test.df3, by = "date") %>%
gather(var, val, -date) %>%
mutate(var = substr(var, 1, 1)) %>%
group_by(date, var) %>%
summarise(val = mean(val, na.rm = TRUE)) %>%
spread(var, val)
Source: local data frame [6 x 4]
date x y z
(date) (dbl) (dbl) (dbl)
1 1998-01-01 10.000000 10.000000 10.000000
2 1998-02-01 10.000000 10.000000 10.000000
3 1998-03-01 7.500000 7.500000 7.500000
4 1998-04-01 7.500000 7.500000 7.500000
5 1998-05-01 5.333333 5.333333 5.333333
6 1998-06-01 5.333333 5.333333 5.333333
基地 R 中的一个班轮应该可以让你到达那里:
aggregate(. ~ date, data=rbind(test.df1,test.df2,test.df3), FUN=mean)
# date x y z
#1 1998-01-01 10.000000 10.000000 10.000000
#2 1998-02-01 10.000000 10.000000 10.000000
#3 1998-03-01 7.500000 7.500000 7.500000
#4 1998-04-01 7.500000 7.500000 7.500000
#5 1998-05-01 5.333333 5.333333 5.333333
#6 1998-06-01 5.333333 5.333333 5.333333
使用 rbind
对所有行做一个大 data.frame
,然后按日期 aggregate
,以便在有重叠时可以计算平均值。
如果您是 dplyr
用户,可以应用相同的逻辑:
library(dplyr)
rbind_all(list(test.df1,test.df2,test.df3)) %>%
group_by(date) %>%
summarise_each(funs(mean))
例如我有 3 个数据帧:
test.df1
date x y z
1 1998-01-01 10 10 10
2 1998-02-01 10 10 10
3 1998-03-01 10 10 10
4 1998-04-01 10 10 10
5 1998-05-01 10 10 10
6 1998-06-01 10 10 10
test.df2
date x y z
1 1998-03-01 5 5 5
2 1998-04-01 5 5 5
3 1998-05-01 5 5 5
4 1998-06-01 5 5 5
test.df3
date x y z
1 1998-05-01 1 1 1
2 1998-06-01 1 1 1
我想将它们合并在一起,以便新数据框的行数与最大数据框中的行数相同(本例中为 test.df1
),并且当日期重叠时,平均值变量被添加到新的数据框中。在上面的示例中,新数据框应具有 4
列和 6
行。
x
、y
和 z
对于 1998-01-01
和 1998-02-01
应保持 10
;
对于 1998-03-01
到 1998-06-01
应该是 7.5
(10 + 5 的平均值);
而对于 1998-05-01
和 1998-06-01
应该是 5.33
(10 + 5 + 1 的平均值)
有没有办法在 r 中做到这一点?
dput(test.df1)
structure(list(date = structure(c(10227, 10258, 10286, 10317,
10347, 10378), class = "Date"), x = c(10, 10, 10, 10, 10, 10),
y = c(10, 10, 10, 10, 10, 10), z = c(10, 10, 10, 10, 10,
10)), .Names = c("date", "x", "y", "z"), row.names = c(NA,
-6L), class = "data.frame")
dput(test.df2)
structure(list(date = structure(c(10286, 10317, 10347, 10378), class = "Date"),
x = c(5, 5, 5, 5), y = c(5, 5, 5, 5), z = c(5, 5, 5, 5)), .Names = c("date",
"x", "y", "z"), row.names = c(NA, -4L), class = "data.frame")
dput(test.df3)
structure(list(date = structure(c(10347, 10378), class = "Date"),
x = c(1, 1), y = c(1, 1), z = c(1, 1)), .Names = c("date",
"x", "y", "z"), row.names = c(NA, -2L), class = "data.frame")
我的方法是首先将数据帧与重复项绑定,然后使用 dplyr 包(执行 colMeans 时,确保排除非数字列):
library(plyr)
test.merge <- rbind(test.df1, test.df2, test.df3)
test.merge <- ddply(test.merge, ~date, function(x){
colMeans(x[,-1])
})
输出:
date x y z
1 1998-01-01 10.000000 10.000000 10.000000
2 1998-02-01 10.000000 10.000000 10.000000
3 1998-03-01 7.500000 7.500000 7.500000
4 1998-04-01 7.500000 7.500000 7.500000
5 1998-05-01 5.333333 5.333333 5.333333
6 1998-06-01 5.333333 5.333333 5.333333
我们可以使用 dplyr
和 tidyr
:
library(dplyr)
library(tidyr)
test.df1 %>% left_join(test.df2, by = "date") %>%
left_join(test.df3, by = "date") %>%
gather(var, val, -date) %>%
mutate(var = substr(var, 1, 1)) %>%
group_by(date, var) %>%
summarise(val = mean(val, na.rm = TRUE)) %>%
spread(var, val)
Source: local data frame [6 x 4]
date x y z
(date) (dbl) (dbl) (dbl)
1 1998-01-01 10.000000 10.000000 10.000000
2 1998-02-01 10.000000 10.000000 10.000000
3 1998-03-01 7.500000 7.500000 7.500000
4 1998-04-01 7.500000 7.500000 7.500000
5 1998-05-01 5.333333 5.333333 5.333333
6 1998-06-01 5.333333 5.333333 5.333333
基地 R 中的一个班轮应该可以让你到达那里:
aggregate(. ~ date, data=rbind(test.df1,test.df2,test.df3), FUN=mean)
# date x y z
#1 1998-01-01 10.000000 10.000000 10.000000
#2 1998-02-01 10.000000 10.000000 10.000000
#3 1998-03-01 7.500000 7.500000 7.500000
#4 1998-04-01 7.500000 7.500000 7.500000
#5 1998-05-01 5.333333 5.333333 5.333333
#6 1998-06-01 5.333333 5.333333 5.333333
使用 rbind
对所有行做一个大 data.frame
,然后按日期 aggregate
,以便在有重叠时可以计算平均值。
如果您是 dplyr
用户,可以应用相同的逻辑:
library(dplyr)
rbind_all(list(test.df1,test.df2,test.df3)) %>%
group_by(date) %>%
summarise_each(funs(mean))