在包含其他子字符串的一列上连接两个数据框
Join two dataframes on one column that contains substring of other
我正在尝试将 df2
左连接到 df1
。
df1
是我感兴趣的数据框,df2
包含我需要的其他信息。
示例:
#df of interest onto which the other should be joined
key1 <- c("London", "Paris", "Berlin", "Delhi")
other_stuff <- c("Tea", "Coffee", "Beer", "Tea")
df1 <- data.frame(key1, other_stuff)
#additional info df
key2 <- c("London and other cities", "some other city", "Eastberlin is history", "Berlin", "Delia is a name", "Delhi is a place")
more_info <- c("history", "languages", "trades", "art", "commerce", "manufacturing")
df2 <- data.frame(key2,more_info)
我现在想要的是搜索 df2$key2
以查找 df1$key1
的精确出现,然后合并到 df1
(例如,将柏林匹配到柏林,但不匹配 Eastberlin,并且德里到德里而不是迪莉娅)而忽略了比赛周围的其他词。
期望的结果:
key1
other_stuff
more_info
London
Tea
history
Paris
Coffee
NA
Berlin
Beer
art
Delhi
Tea
manufacturing
我尝试了 regex_left_join 的变体
joined<- regex_left_join(df1,df2, by = c("key1" = "key2"), ignore_case= F)
和模糊连接
joined<- df1%>% fuzzy_left_join(df2, by = c("key1" = "key2"), match_fun = str_detect)
它们都只给出完全匹配 (key1=key2=Berlin) 的结果,其他所有结果都给出 NA。
我该怎么做?
我也试过 但是 SQL 中的逻辑是错误的。我尝试了其他几种 Stackexchange 方法,但它们对我的数据来说“太模糊”了。
以下适用于已发布的数据示例,但它使用两个连接,可能对较大的数据集无效。
library(dplyr)
library(fuzzyjoin)
left_join(
df1,
regex_left_join(df2, df1, by = c(key2 = "key1"))[c(3, 4, 2)] |> na.omit()
)
#> Joining, by = c("key1", "other_stuff")
#> key1 other_stuff more_info
#> 1 London Tea history
#> 2 Paris Coffee <NA>
#> 3 Berlin Beer art
#> 4 Delhi Tea manufacturing
由 reprex package (v2.0.1)
创建于 2022-02-16
这里我使用“常规”dplyr::left_join
,但在与 df1
连接时在 df2
中进行了一些选择。
首先创建一个包含目标城市的向量。然后我将df2$key2
除以白space,看有没有词匹配向量city
中的字符串。然后 left_join
它与 df1
.
library(tidyverse)
city <- c("London", "Paris", "Berlin", "Delhi")
left_join(df1,
df2 %>% mutate(city = sapply(strsplit(df2$key2, " "),
function(x) first(intersect(city, x)))),
by = c("key1" = "city")) %>%
select(-key2)
key1 other_stuff more_info
1 London Tea history
2 Paris Coffee <NA>
3 Berlin Beer art
4 Delhi Tea manufacturing
您没有得到预期的结果,因为这些函数将 second 数据帧作为正则表达式模式传递,因此您可以使用 regex_right_join
或 fuzzy_right_join
:
df1 %>%
regex_right_join(df2, ., by = c(key2 = "key1")) %>%
select(key1, other_stuff, more_info)
df1 %>%
fuzzy_right_join(df2, ., by = c(key2 = "key1"), match_fun = str_detect) %>%
select(key1, other_stuff, more_info)
输出
key1 other_stuff more_info
1 London Tea history
2 Paris Coffee <NA>
3 Berlin Beer art
4 Delhi Tea manufacturing
我正在尝试将 df2
左连接到 df1
。
df1
是我感兴趣的数据框,df2
包含我需要的其他信息。
示例:
#df of interest onto which the other should be joined
key1 <- c("London", "Paris", "Berlin", "Delhi")
other_stuff <- c("Tea", "Coffee", "Beer", "Tea")
df1 <- data.frame(key1, other_stuff)
#additional info df
key2 <- c("London and other cities", "some other city", "Eastberlin is history", "Berlin", "Delia is a name", "Delhi is a place")
more_info <- c("history", "languages", "trades", "art", "commerce", "manufacturing")
df2 <- data.frame(key2,more_info)
我现在想要的是搜索 df2$key2
以查找 df1$key1
的精确出现,然后合并到 df1
(例如,将柏林匹配到柏林,但不匹配 Eastberlin,并且德里到德里而不是迪莉娅)而忽略了比赛周围的其他词。
期望的结果:
key1 | other_stuff | more_info |
---|---|---|
London | Tea | history |
Paris | Coffee | NA |
Berlin | Beer | art |
Delhi | Tea | manufacturing |
我尝试了 regex_left_join 的变体
joined<- regex_left_join(df1,df2, by = c("key1" = "key2"), ignore_case= F)
和模糊连接
joined<- df1%>% fuzzy_left_join(df2, by = c("key1" = "key2"), match_fun = str_detect)
它们都只给出完全匹配 (key1=key2=Berlin) 的结果,其他所有结果都给出 NA。
我该怎么做?
我也试过
以下适用于已发布的数据示例,但它使用两个连接,可能对较大的数据集无效。
library(dplyr)
library(fuzzyjoin)
left_join(
df1,
regex_left_join(df2, df1, by = c(key2 = "key1"))[c(3, 4, 2)] |> na.omit()
)
#> Joining, by = c("key1", "other_stuff")
#> key1 other_stuff more_info
#> 1 London Tea history
#> 2 Paris Coffee <NA>
#> 3 Berlin Beer art
#> 4 Delhi Tea manufacturing
由 reprex package (v2.0.1)
创建于 2022-02-16这里我使用“常规”dplyr::left_join
,但在与 df1
连接时在 df2
中进行了一些选择。
首先创建一个包含目标城市的向量。然后我将df2$key2
除以白space,看有没有词匹配向量city
中的字符串。然后 left_join
它与 df1
.
library(tidyverse)
city <- c("London", "Paris", "Berlin", "Delhi")
left_join(df1,
df2 %>% mutate(city = sapply(strsplit(df2$key2, " "),
function(x) first(intersect(city, x)))),
by = c("key1" = "city")) %>%
select(-key2)
key1 other_stuff more_info
1 London Tea history
2 Paris Coffee <NA>
3 Berlin Beer art
4 Delhi Tea manufacturing
您没有得到预期的结果,因为这些函数将 second 数据帧作为正则表达式模式传递,因此您可以使用 regex_right_join
或 fuzzy_right_join
:
df1 %>%
regex_right_join(df2, ., by = c(key2 = "key1")) %>%
select(key1, other_stuff, more_info)
df1 %>%
fuzzy_right_join(df2, ., by = c(key2 = "key1"), match_fun = str_detect) %>%
select(key1, other_stuff, more_info)
输出
key1 other_stuff more_info
1 London Tea history
2 Paris Coffee <NA>
3 Berlin Beer art
4 Delhi Tea manufacturing