有没有一种方法可以通过将一列单词与 R 中的一列句子匹配来合并
Is there a way to merge by matching a column of words to a column of sentences in R
例如:
a<-c("This sentence has San-Francisco","This one has london","This one has newYork")
b<-c(10,20,30)
data1<-as.data.frame(cbind(a,b))
c<-c("San Francisco","London", "New York")
d<-c(100,2050,100)
data2<-as.data.frame(cbind(c,d))
所以我想合并数据 1 和数据 2,特别是通过匹配列 a 和列 c。问题在于城市名称的拼写不同,并且句子中通常包含不同位置的城市名称。我试过使用 fuzzjoin 包,但我得到的匹配项非常少。有没有办法自动化这个?基本上我想得到
您可以清理数据以使事情变得更容易,这里使用 stringr
(有许多可能的方法):
我在这里所做的是删除 a
中的所有标点符号、大写字母和空格,然后对 c
执行相同的操作。通过简化 a
和 c
中的字符串,可以更轻松地提取它们之间的匹配项(我的变量 city
)并加入。
library(stringr)
library(dplyr)
library(purrrr)
a <-
c(
"This sentence has San-Francisco",
"This one has london",
"This one has newYork",
"Here also San Francisco"
)
a_test <- str_replace_all(a, " ", "")
a_test <- str_replace_all(a_test, "[:punct:]", "")
a_test <- str_to_lower(a_test)
b <- c(10, 20, 30, 40)
c <- c("San Francisco", "London", "New York")
c_test <- str_replace_all(c, " ", "")
c_test <- str_to_lower(c_test)
d <- c(100, 2050, 100)
city <- map(a_test, str_extract, c_test) %>%
unlist() %>%
na.omit()
data1 <- as.data.frame(cbind(a, city, b))
data2 <- as.data.frame(cbind(c, c_test, d))
inner_join(data1, data2, by = c("city" = "c_test")) %>%
dplyr::select(a, b, c, d)
1 This sentence has San-Francisco 10 San Francisco 100
2 This one has london 20 London 2050
3 This one has newYork 30 New York 100
4 Here also San Francisco 40 San Francisco 100
例如:
a<-c("This sentence has San-Francisco","This one has london","This one has newYork")
b<-c(10,20,30)
data1<-as.data.frame(cbind(a,b))
c<-c("San Francisco","London", "New York")
d<-c(100,2050,100)
data2<-as.data.frame(cbind(c,d))
所以我想合并数据 1 和数据 2,特别是通过匹配列 a 和列 c。问题在于城市名称的拼写不同,并且句子中通常包含不同位置的城市名称。我试过使用 fuzzjoin 包,但我得到的匹配项非常少。有没有办法自动化这个?基本上我想得到
您可以清理数据以使事情变得更容易,这里使用 stringr
(有许多可能的方法):
我在这里所做的是删除 a
中的所有标点符号、大写字母和空格,然后对 c
执行相同的操作。通过简化 a
和 c
中的字符串,可以更轻松地提取它们之间的匹配项(我的变量 city
)并加入。
library(stringr)
library(dplyr)
library(purrrr)
a <-
c(
"This sentence has San-Francisco",
"This one has london",
"This one has newYork",
"Here also San Francisco"
)
a_test <- str_replace_all(a, " ", "")
a_test <- str_replace_all(a_test, "[:punct:]", "")
a_test <- str_to_lower(a_test)
b <- c(10, 20, 30, 40)
c <- c("San Francisco", "London", "New York")
c_test <- str_replace_all(c, " ", "")
c_test <- str_to_lower(c_test)
d <- c(100, 2050, 100)
city <- map(a_test, str_extract, c_test) %>%
unlist() %>%
na.omit()
data1 <- as.data.frame(cbind(a, city, b))
data2 <- as.data.frame(cbind(c, c_test, d))
inner_join(data1, data2, by = c("city" = "c_test")) %>%
dplyr::select(a, b, c, d)
1 This sentence has San-Francisco 10 San Francisco 100
2 This one has london 20 London 2050
3 This one has newYork 30 New York 100
4 Here also San Francisco 40 San Francisco 100