合并两个数据框：专门合并基于两个条件的列选择？

Question

我有两个关于相同 2 个患者的数据集。对于第二个数据集，我想向第一个数据集添加新信息，但我似乎无法正确编写代码。

我的第一个（不完整的）数据集有患者 ID、测量时间（T0 或 FU1）、出生年份、CT 扫描日期和两个结果（legs_mass 和 total_mass ):

library(tidyverse)
library(dplyr)
library(magrittr)
library(lubridate)

df1 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L, 
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0", 
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961, 
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"), 
    legs_mass = c(9.1, NA, NA, NA), total_mass = c(14.5, NA, 
    NA, NA)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", 
"data.frame"))

# Which gives the following dataframe
df1

# A tibble: 4 x 6
     ID time  year_of_birth date_ct    legs_mass total_mass
  <dbl> <fct>         <dbl> <date>         <dbl>      <dbl>
1   115 T0             1970 2015-08-04       9.1       14.5
2   115 FU1            1970 2018-11-07      NA         NA  
3   370 T0             1961 2015-08-04      NA         NA  
4   370 FU1            1961 2020-09-30      NA         NA

第二个数据集添加到 legs_mass 和 total_mass 列：

df2 <- structure(list(ID = c(115, 370), date_ct = structure(c(17842, 
18535), class = "Date"), ctscan_label = c("PXE115_CT_20181107_xxxxx-3.tif", 
"PXE370_CT_20200930_xxxxx-403.tif"), legs_mass = c(956.1, 21.3
), total_mass = c(1015.9, 21.3)), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))

# Which gives the following dataframe:
df2

# A tibble: 2 x 5
     ID date_ct    ctscan_label                     legs_mass total_mass
  <dbl> <date>     <chr>                                <dbl>      <dbl>
1   115 2018-11-07 PXE115_CT_20181107_xxxxx-3.tif       956.      1016. 
2   370 2020-09-30 PXE370_CT_20200930_xxxxx-403.tif      21.3       21.3

我想做的是...

根据 ID 号和 date_ct.
将 df2 的新列（不在 df1 中的列；ctscan_label）添加到 df1，同样基于 ct 的日期和患者 ID。这样最终的数据集 df3 看起来如下：

df3 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L, 
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0", 
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961, 
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"), 
    legs_mass = c(9.1, 956.1, NA, 21.3), total_mass = c(14.5, 
    1015.9, NA, 21.3)), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"))

# Corresponding to the following tibble:
# A tibble: 4 x 6
     ID time  year_of_birth date_ct    legs_mass total_mass
  <dbl> <fct>         <dbl> <date>         <dbl>      <dbl>
1   115 T0             1970 2015-08-04       9.1       14.5
2   115 FU1            1970 2018-11-07     956.      1016. 
3   370 T0             1961 2015-08-04      NA         NA  
4   370 FU1            1961 2020-09-30      21.3       21.3

我尝试了 baseR 的合并功能和 rbind，以及 dplyr 的 bind_rows，但似乎无法正确使用。

有什么帮助吗？

Answer 1

您可以连接两个数据集并使用coalesce从两个数据集中保留一个非 NA 值。

library(dplyr)

left_join(df1, df2, by = c("ID", "date_ct")) %>%
  mutate(leg_mass = coalesce(legs_mass.x , legs_mass.y), 
         total_mass = coalesce(total_mass.x, total_mass.y)) %>%
  select(-matches('\.x|\.y'), -ctscan_label)

#     ID time  year_of_birth date_ct    leg_mass total_mass
#  <dbl> <fct>         <dbl> <date>        <dbl>      <dbl>
#1   115 T0             1970 2015-08-04      9.1       14.5
#2   115 FU1            1970 2018-11-07    956.      1016. 
#3   370 T0             1961 2015-08-04     NA         NA  
#4   370 FU1            1961 2020-09-30     21.3       21.3

Answer 2

我们可以使用data.table方法

library(data.table)
setDT(df1)[setDT(df2), c("legs_mass", "total_mass") := 
   .(fcoalesce(legs_mass, i.legs_mass), 
     fcoalesce(total_mass, i.total_mass)), on = .(ID, date_ct)]

-输出

df1
    ID time year_of_birth    date_ct legs_mass total_mass
1: 115   T0          1970 2015-08-04       9.1       14.5
2: 115  FU1          1970 2018-11-07     956.1     1015.9
3: 370   T0          1961 2015-08-04        NA         NA
4: 370  FU1          1961 2020-09-30      21.3       21.3

合并两个数据框：专门合并基于两个条件的列选择？

Merge two dataframes: specifically merge a selection of columns based on two conditions?

merge

r

rbind

dplyr