如何在 R 中连接 2 个具有最大匹配字符串的表?
How to join 2 tables with maximum matched string in R?
我想加入 2 table1 & table 2(在 COlB 上左侧,在 ColD 上右侧)具有最大匹配字符串
Table 1
ColA
ColB
123
C/O room Hanbur court vaux road
456
House Malveri business park
Table 2
ColD
ColC
Hanbur Court
Lightroom
Malveri park
Office
输出Table
ColA
ColB
Colc
123
C/O room Hanbur court vaux road
Lightroom
456
House Malveri business park
Office
使用fuzzyjoin
,可以根据距离选择加入
library(fuzzyjoin)
library(dplyr)
stringdist_inner_join(df1, df2, by = c(ColB = "ColD"),
max_dist = 0.5, method = "jaccard") %>%
select(-ColD)
ColA ColB ColC
1 123 C/O room Hanbur court vaux road Lightroom
2 456 House Malveri business park Office
数据
df1 <- structure(list(ColA = c(123L, 456L),
ColB = c("C/O room Hanbur court vaux road",
"House Malveri business park")), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(ColD = c("Hanbur Court", "Malveri park"),
ColC = c("Lightroom",
"Office")), class = "data.frame", row.names = c(NA, -2L))
这个相当复杂,但它完成了工作:
library(dplyr)
library(stringr)
library(tidyr)
# prepare df2 to get pattern for `str_detect` later
df2_new <- df2 %>%
separate_rows(ColD, sep = " ") %>%
mutate(helper = tolower(ColD))
# create pattern to match
pattern <- paste(df2_new$helper, collapse = "|")
# do the calculations
df %>%
separate_rows(ColB, sep = " ") %>%
mutate(helper = tolower(ColB),
helper1 = ifelse(str_detect(helper, pattern), 1, 0)) %>%
group_by(ColA) %>%
mutate(helper = paste(helper[helper1==1], collapse = " "),
ColB = paste(ColB, collapse = " "), .keep="unused") %>%
slice(1) %>%
right_join(df2 %>%
mutate(helper = tolower(ColD)), by="helper") %>%
select(ColA, ColB, ColC)
ColA ColB ColC
<int> <chr> <chr>
1 123 C/O room Hanbur court vaux road Lightroom
2 456 House Malveri business park Office
我想加入 2 table1 & table 2(在 COlB 上左侧,在 ColD 上右侧)具有最大匹配字符串
Table 1
ColA | ColB |
---|---|
123 | C/O room Hanbur court vaux road |
456 | House Malveri business park |
Table 2
ColD | ColC |
---|---|
Hanbur Court | Lightroom |
Malveri park | Office |
输出Table
ColA | ColB | Colc |
---|---|---|
123 | C/O room Hanbur court vaux road | Lightroom |
456 | House Malveri business park | Office |
使用fuzzyjoin
,可以根据距离选择加入
library(fuzzyjoin)
library(dplyr)
stringdist_inner_join(df1, df2, by = c(ColB = "ColD"),
max_dist = 0.5, method = "jaccard") %>%
select(-ColD)
ColA ColB ColC
1 123 C/O room Hanbur court vaux road Lightroom
2 456 House Malveri business park Office
数据
df1 <- structure(list(ColA = c(123L, 456L),
ColB = c("C/O room Hanbur court vaux road",
"House Malveri business park")), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(ColD = c("Hanbur Court", "Malveri park"),
ColC = c("Lightroom",
"Office")), class = "data.frame", row.names = c(NA, -2L))
这个相当复杂,但它完成了工作:
library(dplyr)
library(stringr)
library(tidyr)
# prepare df2 to get pattern for `str_detect` later
df2_new <- df2 %>%
separate_rows(ColD, sep = " ") %>%
mutate(helper = tolower(ColD))
# create pattern to match
pattern <- paste(df2_new$helper, collapse = "|")
# do the calculations
df %>%
separate_rows(ColB, sep = " ") %>%
mutate(helper = tolower(ColB),
helper1 = ifelse(str_detect(helper, pattern), 1, 0)) %>%
group_by(ColA) %>%
mutate(helper = paste(helper[helper1==1], collapse = " "),
ColB = paste(ColB, collapse = " "), .keep="unused") %>%
slice(1) %>%
right_join(df2 %>%
mutate(helper = tolower(ColD)), by="helper") %>%
select(ColA, ColB, ColC)
ColA ColB ColC
<int> <chr> <chr>
1 123 C/O room Hanbur court vaux road Lightroom
2 456 House Malveri business park Office