完全连接两个数据集,但保留我匹配的两列,并在不完全匹配时添加新行

Full Join two datasets but keep both columns I'm matching on and add a new row when its not an exact match

这是我的两个数据集:


data1 = data.frame (id =c(1,1,1,1,1,1,1,1,1),
                    drug = c(    "drug1",   "drug1",      "drug2",     "drug3",     "drug4",     "drug4",     "drug5",     "drug6",      "drug7"),
                    date_tx=c("2014-01-21","2015-04-01","2016-03-15","2013-01-13","2014-01-02","2017-04-05","2021-07-22","2022-03-01","2016-01-28"))
data2 = data.frame (id =c(1,1,1,1,1,1,1,1,1,1),
                    drug = c(    "drug1",     "drug1",      "drug2",    "drug3",   "drug4",      "drug4",      "drug5",     "drug6",      "drug7",     "drug8"),
                    date_plan=c("2014-01-23","2015-04-01","2016-03-15","2013-03-01","2014-01-02","2017-04-05","2021-07-24","2022-03-01","2016-01-20","2016-05-05"))

我想使用 id、drug 和两个日期(date_tx 和 date_plan)进行完全连接。即使我使用 Date 进行连接,我也想保留两列。因为在两个日期不匹配的情况下(即前两个日期),我希望在各自的列中有两行不同的日期。

我希望得到的是:

output = data.frame (id =c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
                      drug = c(   "drug1",     "drug1", "drug1", "drug2",     "drug3",   "drug3",     "drug4",    "drug4",     "drug5",      "drug5",    "drug6",    "drug7", "drug7", "drug8"),
                      date_tx=c("2014-01-21","2015-04-01",NA, "2016-03-15","2013-01-13",    NA,   "2014-01-02","2017-04-05","2021-07-22",      NA,      "2022-03-01","2016-01-28",NA,NA),
                      date_plan=c(NA,"2015-04-01","2014-01-23","2016-03-15",     NA,  "2013-03-01","2014-01-02","2017-04-05",    NA,       "2021-07-24", "2022-03-01", NA, "2016-01-20","2016-05-05"))

我试过的。下面确实给出了行数,但对于不匹配的行数,我需要能够区分 date_column 它来自什么。

merge <- full_join(data1, data2, by=c("id"="id", "drug"="drug", "date_tx"="date_plan"))

如有任何帮助,我们将不胜感激!

在每个名为 date_merge 的合并数据框中创建日期列的副本是否能得到您需要的结果?

data1 %>% 
  mutate(date_merge = date_tx) %>% 
  full_join(data2 %>% 
              mutate(date_merge = date_plan), 
            by=c("id", "drug", "date_merge")) %>% 
  select(-date_merge) %>% 
  arrange(id, drug)

您可以尝试使用@moodymudskipper 提供的powerjoin 包。您可以进行完全联接并指示 keep = "both" 以保留您感兴趣的两列。带有 coalesce 的 conflict 参数将解析 2 data.frames 中相同的列名。我在末尾添加了 arrangeselect,因此最终结果与 post.

中的 output 相同
library(powerjoin)

power_full_join(
  data1,
  data2,
  by = c("id", "drug", "date_tx" = "date_plan"),
  keep = "both",
  conflict = coalesce_xy
) %>%
  arrange(id, drug) %>%
  select(id, drug, date_tx, date_plan)

输出

   id  drug    date_tx  date_plan
1   1 drug1 2014-01-21       <NA>
2   1 drug1 2015-04-01 2015-04-01
3   1 drug1       <NA> 2014-01-23
4   1 drug2 2016-03-15 2016-03-15
5   1 drug3 2013-01-13       <NA>
6   1 drug3       <NA> 2013-03-01
7   1 drug4 2014-01-02 2014-01-02
8   1 drug4 2017-04-05 2017-04-05
9   1 drug5 2021-07-22       <NA>
10  1 drug5       <NA> 2021-07-24
11  1 drug6 2022-03-01 2022-03-01
12  1 drug7 2016-01-28       <NA>
13  1 drug7       <NA> 2016-01-20
14  1 drug8       <NA> 2016-05-05