在包含其他子字符串的一列上连接两个数据框

Join two dataframes on one column that contains substring of other

我正在尝试将 df2 左连接到 df1

df1 是我感兴趣的数据框,df2 包含我需要的其他信息。

示例:

#df of interest onto which the other should be joined
key1 <- c("London", "Paris", "Berlin", "Delhi") 
other_stuff <- c("Tea", "Coffee", "Beer", "Tea") 
df1 <- data.frame(key1, other_stuff)

#additional info df
key2 <- c("London and other cities", "some other city", "Eastberlin is history", "Berlin", "Delia is a name", "Delhi is a place") 
more_info <- c("history", "languages", "trades", "art", "commerce", "manufacturing")
df2 <- data.frame(key2,more_info)

我现在想要的是搜索 df2$key2 以查找 df1$key1 的精确出现,然后合并到 df1(例如,将柏林匹配到柏林,但不匹配 Eastberlin,并且德里到德里而不是迪莉娅)而忽略了比赛周围的其他词。

期望的结果:

key1 other_stuff more_info
London Tea history
Paris Coffee NA
Berlin Beer art
Delhi Tea manufacturing

我尝试了 regex_left_join 的变体 joined<- regex_left_join(df1,df2, by = c("key1" = "key2"), ignore_case= F)

和模糊连接 joined<- df1%>% fuzzy_left_join(df2, by = c("key1" = "key2"), match_fun = str_detect)

它们都只给出完全匹配 (key1=key2=Berlin) 的结果,其他所有结果都给出 NA。

我该怎么做?

我也试过 但是 SQL 中的逻辑是错误的。我尝试了其他几种 Stackexchange 方法,但它们对我的数据来说“太模糊”了。

以下适用于已发布的数据示例,但它使用两个连接,可能对较大的数据集无效。

library(dplyr)
library(fuzzyjoin)

left_join(
  df1,
  regex_left_join(df2, df1, by = c(key2 = "key1"))[c(3, 4, 2)] |> na.omit()
)
#> Joining, by = c("key1", "other_stuff")
#>     key1 other_stuff     more_info
#> 1 London         Tea       history
#> 2  Paris      Coffee          <NA>
#> 3 Berlin        Beer           art
#> 4  Delhi         Tea manufacturing

reprex package (v2.0.1)

创建于 2022-02-16

这里我使用“常规”dplyr::left_join,但在与 df1 连接时在 df2 中进行了一些选择。

首先创建一个包含目标城市的向量。然后我将df2$key2除以白space,看有没有词匹配向量city中的字符串。然后 left_join 它与 df1.

library(tidyverse)

city <- c("London", "Paris", "Berlin", "Delhi")

left_join(df1,
          df2 %>% mutate(city = sapply(strsplit(df2$key2, " "), 
                                       function(x) first(intersect(city, x)))),
          by = c("key1" = "city")) %>% 
  select(-key2)

    key1 other_stuff     more_info
1 London         Tea       history
2  Paris      Coffee          <NA>
3 Berlin        Beer           art
4  Delhi         Tea manufacturing

您没有得到预期的结果,因为这些函数将 second 数据帧作为正则表达式模式传递,因此您可以使用 regex_right_joinfuzzy_right_join:

df1 %>% 
  regex_right_join(df2, ., by = c(key2 = "key1")) %>% 
  select(key1, other_stuff, more_info)

df1 %>% 
  fuzzy_right_join(df2, ., by = c(key2 = "key1"), match_fun = str_detect) %>% 
  select(key1, other_stuff, more_info)

输出

    key1 other_stuff     more_info
1 London         Tea       history
2  Paris      Coffee          <NA>
3 Berlin        Beer           art
4  Delhi         Tea manufacturing