有没有一种方法可以通过将一列单词与 R 中的一列句子匹配来合并

Question

例如：

a<-c("This sentence has San-Francisco","This one has london","This one has newYork")
b<-c(10,20,30)

data1<-as.data.frame(cbind(a,b))

c<-c("San Francisco","London", "New York")
d<-c(100,2050,100)

data2<-as.data.frame(cbind(c,d))

所以我想合并数据 1 和数据 2，特别是通过匹配列 a 和列 c。问题在于城市名称的拼写不同，并且句子中通常包含不同位置的城市名称。我试过使用 fuzzjoin 包，但我得到的匹配项非常少。有没有办法自动化这个？基本上我想得到

Answer 1

您可以清理数据以使事情变得更容易，这里使用 stringr（有许多可能的方法）：

我在这里所做的是删除 a 中的所有标点符号、大写字母和空格，然后对 c 执行相同的操作。通过简化 a 和 c 中的字符串，可以更轻松地提取它们之间的匹配项（我的变量 city）并加入。

library(stringr)
library(dplyr)
library(purrrr)
a <-
  c(
    "This sentence has San-Francisco",
    "This one has london",
    "This one has newYork",
    "Here also San Francisco"
  )
a_test <- str_replace_all(a, " ", "")
a_test <- str_replace_all(a_test, "[:punct:]", "")
a_test <- str_to_lower(a_test)

b <- c(10, 20, 30, 40)

c <- c("San Francisco", "London", "New York")
c_test <- str_replace_all(c, " ", "")
c_test <- str_to_lower(c_test)

d <- c(100, 2050, 100)

city <- map(a_test, str_extract, c_test) %>%
  unlist() %>%
  na.omit()

data1 <- as.data.frame(cbind(a, city, b))

data2 <- as.data.frame(cbind(c, c_test, d))

inner_join(data1, data2, by = c("city" = "c_test")) %>%
  dplyr::select(a, b, c, d)
1 This sentence has San-Francisco 10 San Francisco  100
2             This one has london 20        London 2050
3            This one has newYork 30      New York  100
4         Here also San Francisco 40 San Francisco  100

有没有一种方法可以通过将一列单词与 R 中的一列句子匹配来合并

Is there a way to merge by matching a column of words to a column of sentences in R

r

dplyr

fuzzyjoin