R：合并 2 个数据帧并将参考数据应用于一级匹配的所有行

Question

我有两个数据框：一个 ("grny") 主要是参考，但在我之后的 "yield" 列中也有一些数据，另一个 ("txie" ) 将有 "yield" 数据，其中有一些 NA 用于缺失数据。我想合并它们，以便 "site" 中具有共同值的行中的所有单元格都是完整的。

其中大部分逐年数据是：

txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))

主要参考一些逐年的收益率数据：

grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))

我想要的：

         site    yield year prim  lib      lat
1  smithfield 7.009178 1999   nt 1109     43.61828
2  smithfield 8.472677 2000   nt 1109     43.61828
3  belleville 8.857462 1992   nt 122      74.08792
4  belleville 7.368488 1993   nt 122      74.08792
5  belleville       NA 1994   nt 122      74.08792
6  nashua     7.494519 1990   nt 554      49.10000
8  nashua     8.696066 1991   ct 554      49.10000
9  nashua     8.051670 1992   nt 554      49.10000

我尝试过的：

rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as  the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df  (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line

例如，对于外部联接合并，我最终得到：

     site  yield.x year.x prim.x  yield.y year.y prim.y      lat
1 belleville 6.766628   1992   <NA>       NA     NA     nt 34.92136
2 belleville 6.845789   1993   <NA>       NA     NA     nt 34.92136
3 belleville       NA   1994   <NA>       NA     NA     nt 34.92136
4 smithfield 8.841339   1999     nt       NA     NA   <NA> 49.81872
5 smithfield 7.313310   2000     nt       NA     NA   <NA> 49.81872
6     nashua       NA     NA   <NA> 9.173229   1990     ct 49.10000
7     nashua       NA     NA   <NA> 9.196018   1991     nt 49.10000
8     nashua       NA     NA   <NA> 7.336645   1992     ct 49.10000

规定：我想保留已经在 "yield" 列中的 NA（例如 1994 年的 nashua）。任何答案或者有人可以告诉我这种合并的例子在哪里（数据已经在一个或多个共享列中，你没有合并，每个 df bringing in new columns 除了 "by" 变量）是？

谢谢！！！

Answer 1

使用 dplyr 包，您可以执行 full_join 然后使用 coalesce 函数在 yield.x 与列对中获取非 NA 值yield.y、prim.x 与 prim.y 等

library(dplyr)
full_join(txie,grny,by="site") %>%
mutate(year = coalesce(year.x,.$year.y),
yield = coalesce(yield.x,yield.y),
prim = coalesce(prim.x,prim.y)) %>% 
select(-c(year.x,year.y,yield.x,yield.y,prim.x,prim.y)) 

        site      lat year     yield prim
1 smithfield 59.71994 1999  7.920844   nt
2 smithfield 59.71994 2000 10.122713   nt
3 belleville 34.93358 1992  8.622351   nt
4 belleville 34.93358 1993  7.360470   nt
5 belleville 34.93358 1994        NA   nt
6     nashua 49.10000 1990  9.083390   ct
7     nashua 49.10000 1991  8.073866   nt
8     nashua 49.10000 1992  8.725625   nt

R：合并 2 个数据帧并将参考数据应用于一级匹配的所有行

R: Merging 2 dataframes and applying reference data to all rows that match by one level

merge

r

rbind