完全连接两个数据集,但保留我匹配的两列,并在不完全匹配时添加新行
Full Join two datasets but keep both columns I'm matching on and add a new row when its not an exact match
这是我的两个数据集:
data1 = data.frame (id =c(1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug2", "drug3", "drug4", "drug4", "drug5", "drug6", "drug7"),
date_tx=c("2014-01-21","2015-04-01","2016-03-15","2013-01-13","2014-01-02","2017-04-05","2021-07-22","2022-03-01","2016-01-28"))
data2 = data.frame (id =c(1,1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug2", "drug3", "drug4", "drug4", "drug5", "drug6", "drug7", "drug8"),
date_plan=c("2014-01-23","2015-04-01","2016-03-15","2013-03-01","2014-01-02","2017-04-05","2021-07-24","2022-03-01","2016-01-20","2016-05-05"))
我想使用 id、drug 和两个日期(date_tx 和 date_plan)进行完全连接。即使我使用 Date 进行连接,我也想保留两列。因为在两个日期不匹配的情况下(即前两个日期),我希望在各自的列中有两行不同的日期。
我希望得到的是:
output = data.frame (id =c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug1", "drug2", "drug3", "drug3", "drug4", "drug4", "drug5", "drug5", "drug6", "drug7", "drug7", "drug8"),
date_tx=c("2014-01-21","2015-04-01",NA, "2016-03-15","2013-01-13", NA, "2014-01-02","2017-04-05","2021-07-22", NA, "2022-03-01","2016-01-28",NA,NA),
date_plan=c(NA,"2015-04-01","2014-01-23","2016-03-15", NA, "2013-03-01","2014-01-02","2017-04-05", NA, "2021-07-24", "2022-03-01", NA, "2016-01-20","2016-05-05"))
我试过的。下面确实给出了行数,但对于不匹配的行数,我需要能够区分 date_column 它来自什么。
merge <- full_join(data1, data2, by=c("id"="id", "drug"="drug", "date_tx"="date_plan"))
如有任何帮助,我们将不胜感激!
在每个名为 date_merge
的合并数据框中创建日期列的副本是否能得到您需要的结果?
data1 %>%
mutate(date_merge = date_tx) %>%
full_join(data2 %>%
mutate(date_merge = date_plan),
by=c("id", "drug", "date_merge")) %>%
select(-date_merge) %>%
arrange(id, drug)
您可以尝试使用@moodymudskipper 提供的powerjoin
包。您可以进行完全联接并指示 keep = "both"
以保留您感兴趣的两列。带有 coalesce 的 conflict
参数将解析 2 data.frames 中相同的列名。我在末尾添加了 arrange
和 select
,因此最终结果与 post.
中的 output
相同
library(powerjoin)
power_full_join(
data1,
data2,
by = c("id", "drug", "date_tx" = "date_plan"),
keep = "both",
conflict = coalesce_xy
) %>%
arrange(id, drug) %>%
select(id, drug, date_tx, date_plan)
输出
id drug date_tx date_plan
1 1 drug1 2014-01-21 <NA>
2 1 drug1 2015-04-01 2015-04-01
3 1 drug1 <NA> 2014-01-23
4 1 drug2 2016-03-15 2016-03-15
5 1 drug3 2013-01-13 <NA>
6 1 drug3 <NA> 2013-03-01
7 1 drug4 2014-01-02 2014-01-02
8 1 drug4 2017-04-05 2017-04-05
9 1 drug5 2021-07-22 <NA>
10 1 drug5 <NA> 2021-07-24
11 1 drug6 2022-03-01 2022-03-01
12 1 drug7 2016-01-28 <NA>
13 1 drug7 <NA> 2016-01-20
14 1 drug8 <NA> 2016-05-05
这是我的两个数据集:
data1 = data.frame (id =c(1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug2", "drug3", "drug4", "drug4", "drug5", "drug6", "drug7"),
date_tx=c("2014-01-21","2015-04-01","2016-03-15","2013-01-13","2014-01-02","2017-04-05","2021-07-22","2022-03-01","2016-01-28"))
data2 = data.frame (id =c(1,1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug2", "drug3", "drug4", "drug4", "drug5", "drug6", "drug7", "drug8"),
date_plan=c("2014-01-23","2015-04-01","2016-03-15","2013-03-01","2014-01-02","2017-04-05","2021-07-24","2022-03-01","2016-01-20","2016-05-05"))
我想使用 id、drug 和两个日期(date_tx 和 date_plan)进行完全连接。即使我使用 Date 进行连接,我也想保留两列。因为在两个日期不匹配的情况下(即前两个日期),我希望在各自的列中有两行不同的日期。
我希望得到的是:
output = data.frame (id =c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug1", "drug2", "drug3", "drug3", "drug4", "drug4", "drug5", "drug5", "drug6", "drug7", "drug7", "drug8"),
date_tx=c("2014-01-21","2015-04-01",NA, "2016-03-15","2013-01-13", NA, "2014-01-02","2017-04-05","2021-07-22", NA, "2022-03-01","2016-01-28",NA,NA),
date_plan=c(NA,"2015-04-01","2014-01-23","2016-03-15", NA, "2013-03-01","2014-01-02","2017-04-05", NA, "2021-07-24", "2022-03-01", NA, "2016-01-20","2016-05-05"))
我试过的。下面确实给出了行数,但对于不匹配的行数,我需要能够区分 date_column 它来自什么。
merge <- full_join(data1, data2, by=c("id"="id", "drug"="drug", "date_tx"="date_plan"))
如有任何帮助,我们将不胜感激!
在每个名为 date_merge
的合并数据框中创建日期列的副本是否能得到您需要的结果?
data1 %>%
mutate(date_merge = date_tx) %>%
full_join(data2 %>%
mutate(date_merge = date_plan),
by=c("id", "drug", "date_merge")) %>%
select(-date_merge) %>%
arrange(id, drug)
您可以尝试使用@moodymudskipper 提供的powerjoin
包。您可以进行完全联接并指示 keep = "both"
以保留您感兴趣的两列。带有 coalesce 的 conflict
参数将解析 2 data.frames 中相同的列名。我在末尾添加了 arrange
和 select
,因此最终结果与 post.
output
相同
library(powerjoin)
power_full_join(
data1,
data2,
by = c("id", "drug", "date_tx" = "date_plan"),
keep = "both",
conflict = coalesce_xy
) %>%
arrange(id, drug) %>%
select(id, drug, date_tx, date_plan)
输出
id drug date_tx date_plan
1 1 drug1 2014-01-21 <NA>
2 1 drug1 2015-04-01 2015-04-01
3 1 drug1 <NA> 2014-01-23
4 1 drug2 2016-03-15 2016-03-15
5 1 drug3 2013-01-13 <NA>
6 1 drug3 <NA> 2013-03-01
7 1 drug4 2014-01-02 2014-01-02
8 1 drug4 2017-04-05 2017-04-05
9 1 drug5 2021-07-22 <NA>
10 1 drug5 <NA> 2021-07-24
11 1 drug6 2022-03-01 2022-03-01
12 1 drug7 2016-01-28 <NA>
13 1 drug7 <NA> 2016-01-20
14 1 drug8 <NA> 2016-05-05