R:合并 2 个数据帧并将参考数据应用于一级匹配的所有行
R: Merging 2 dataframes and applying reference data to all rows that match by one level
我有两个数据框:一个 ("grny") 主要是参考,但在我之后的 "yield" 列中也有一些数据,另一个 ("txie" ) 将有 "yield" 数据,其中有一些 NA 用于缺失数据。我想合并它们,以便 "site" 中具有共同值的行中的所有单元格都是完整的。
其中大部分逐年数据是:
txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))
主要参考一些逐年的收益率数据:
grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))
我想要的:
site yield year prim lib lat
1 smithfield 7.009178 1999 nt 1109 43.61828
2 smithfield 8.472677 2000 nt 1109 43.61828
3 belleville 8.857462 1992 nt 122 74.08792
4 belleville 7.368488 1993 nt 122 74.08792
5 belleville NA 1994 nt 122 74.08792
6 nashua 7.494519 1990 nt 554 49.10000
8 nashua 8.696066 1991 ct 554 49.10000
9 nashua 8.051670 1992 nt 554 49.10000
我尝试过的:
rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line
例如,对于外部联接合并,我最终得到:
site yield.x year.x prim.x yield.y year.y prim.y lat
1 belleville 6.766628 1992 <NA> NA NA nt 34.92136
2 belleville 6.845789 1993 <NA> NA NA nt 34.92136
3 belleville NA 1994 <NA> NA NA nt 34.92136
4 smithfield 8.841339 1999 nt NA NA <NA> 49.81872
5 smithfield 7.313310 2000 nt NA NA <NA> 49.81872
6 nashua NA NA <NA> 9.173229 1990 ct 49.10000
7 nashua NA NA <NA> 9.196018 1991 nt 49.10000
8 nashua NA NA <NA> 7.336645 1992 ct 49.10000
规定:我想保留已经在 "yield" 列中的 NA(例如 1994 年的 nashua)。
任何答案或者有人可以告诉我这种合并的例子在哪里(数据已经在一个或多个共享列中,你没有合并,每个 df bringing in new columns 除了 "by" 变量)是?
谢谢!!!
使用 dplyr
包,您可以执行 full_join
然后使用 coalesce
函数在 yield.x
与列对中获取非 NA 值yield.y
、prim.x
与 prim.y
等
library(dplyr)
full_join(txie,grny,by="site") %>%
mutate(year = coalesce(year.x,.$year.y),
yield = coalesce(yield.x,yield.y),
prim = coalesce(prim.x,prim.y)) %>%
select(-c(year.x,year.y,yield.x,yield.y,prim.x,prim.y))
site lat year yield prim
1 smithfield 59.71994 1999 7.920844 nt
2 smithfield 59.71994 2000 10.122713 nt
3 belleville 34.93358 1992 8.622351 nt
4 belleville 34.93358 1993 7.360470 nt
5 belleville 34.93358 1994 NA nt
6 nashua 49.10000 1990 9.083390 ct
7 nashua 49.10000 1991 8.073866 nt
8 nashua 49.10000 1992 8.725625 nt
我有两个数据框:一个 ("grny") 主要是参考,但在我之后的 "yield" 列中也有一些数据,另一个 ("txie" ) 将有 "yield" 数据,其中有一些 NA 用于缺失数据。我想合并它们,以便 "site" 中具有共同值的行中的所有单元格都是完整的。
其中大部分逐年数据是:
txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))
主要参考一些逐年的收益率数据:
grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))
我想要的:
site yield year prim lib lat
1 smithfield 7.009178 1999 nt 1109 43.61828
2 smithfield 8.472677 2000 nt 1109 43.61828
3 belleville 8.857462 1992 nt 122 74.08792
4 belleville 7.368488 1993 nt 122 74.08792
5 belleville NA 1994 nt 122 74.08792
6 nashua 7.494519 1990 nt 554 49.10000
8 nashua 8.696066 1991 ct 554 49.10000
9 nashua 8.051670 1992 nt 554 49.10000
我尝试过的:
rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line
例如,对于外部联接合并,我最终得到:
site yield.x year.x prim.x yield.y year.y prim.y lat
1 belleville 6.766628 1992 <NA> NA NA nt 34.92136
2 belleville 6.845789 1993 <NA> NA NA nt 34.92136
3 belleville NA 1994 <NA> NA NA nt 34.92136
4 smithfield 8.841339 1999 nt NA NA <NA> 49.81872
5 smithfield 7.313310 2000 nt NA NA <NA> 49.81872
6 nashua NA NA <NA> 9.173229 1990 ct 49.10000
7 nashua NA NA <NA> 9.196018 1991 nt 49.10000
8 nashua NA NA <NA> 7.336645 1992 ct 49.10000
规定:我想保留已经在 "yield" 列中的 NA(例如 1994 年的 nashua)。 任何答案或者有人可以告诉我这种合并的例子在哪里(数据已经在一个或多个共享列中,你没有合并,每个 df bringing in new columns 除了 "by" 变量)是?
谢谢!!!
使用 dplyr
包,您可以执行 full_join
然后使用 coalesce
函数在 yield.x
与列对中获取非 NA 值yield.y
、prim.x
与 prim.y
等
library(dplyr)
full_join(txie,grny,by="site") %>%
mutate(year = coalesce(year.x,.$year.y),
yield = coalesce(yield.x,yield.y),
prim = coalesce(prim.x,prim.y)) %>%
select(-c(year.x,year.y,yield.x,yield.y,prim.x,prim.y))
site lat year yield prim
1 smithfield 59.71994 1999 7.920844 nt
2 smithfield 59.71994 2000 10.122713 nt
3 belleville 34.93358 1992 8.622351 nt
4 belleville 34.93358 1993 7.360470 nt
5 belleville 34.93358 1994 NA nt
6 nashua 49.10000 1990 9.083390 ct
7 nashua 49.10000 1991 8.073866 nt
8 nashua 49.10000 1992 8.725625 nt