通过考虑成对列的值是否在另一个数据框中成对列的值的范围内来连接两个数据框

Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe

所以让我提供一些例子来解释我的问题。我有两个数据 df1 和 df2。我想左连接两个数据集。满足两个条件。

(1)周相同
(2) df1 中的 m1 和 m2 与 df2 中的 m1 和 m2 相同,但忽略列名

所以预期输出是 df3


 df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))

df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2)) 

print(df1)
      m1     m2 week
1 100010 100020    1
2 100010 100020    2
3 100010 100020    3
4 100020 100010    1
5 100020 100010    1
6 100020 100010    3
print(df2)
      m1     m2 week freq
1 100010 100020    1    3
2 100010 100020    2    1
3 100010 100020    3    2
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))
print(df3)
      m1     m2 week freq
1 100010 100020    1    3
2 100010 100020    2    1
3 100010 100020    3    2
4 100020 100010    1    3
5 100020 100010    1    3
6 100020 100010    3    2

我尝试单独合并,但它为 freq 创建了重复的列,这是不需要的。这是我可以尝试的其他任何东西吗?非常感谢!

如果我们想要 OR 加入,我们可以使用 regex_left_join from fuzzyjoin

library(dplyr)
library(fuzzyjoin)
library(stringr)
regex_left_join(df1 %>% 
   mutate(m1m2 = str_c(m1, m2, sep = "|")), 
   df2 %>%
    mutate(m1m2 = str_c(m1, m2, sep = "|"), .keep = "unused"), 
     by = c("m1m2", "week")) %>% 
   select(m1, m2, week = week.x, freq)

-输出

      m1     m2 week freq
1 100010 100020    1    3
2 100010 100020    2    1
3 100010 100020    3    2
4 100020 100010    1    3
5 100020 100010    1    3
6 100020 100010    3    2

我想我会根据您的喜好提出两种方法。第一个是使用 SQL 而不是 R 来完成任务。对于您描述的连接类型,它更直接一些。

library(sqldf)
library(dplyr)

df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2)) 
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))

df_sql <- 
  sqldf::sqldf("SELECT a.*, b.freq
               FROM df1 a
               LEFT JOIN df2 b 
               ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
                  (a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")

identical(df_sql, df3)
#> [1] TRUE

我相信有更优雅的方法可以做到这一点,但第二种策略只是复制 df2,重命名 m1m2 的列,然后加入。

df <-
  df2 %>%
  rename(m2 = m1, m1 = m2) %>%
  bind_rows(df2, .) %>%
  left_join(df1, ., by = c("week", "m1", "m2"))


identical(df, df3)
#> [1] TRUE

我想还有其他不涉及连接的方法,但这就是我使用连接的方式。

reprex package (v2.0.1)

创建于 2022-02-17