两个数据集之间R中的近似字符串匹配
Approximate string matching in R between two datasets
我有以下包含电影片名和相应类型的数据集,而另一个数据集包含可能引用或不引用这些片名的纯文本:
dt1
title genre
Secret in Their Eyes Dramas
V for Vendetta Action & Adventure
Bottersnikes & Gumbles Kids' TV
... ...
和
dt2
id Text
1. "I really liked V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... "
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4 "@thewitcher was an interesting series
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
... etc
我想要获得的是一个函数,它匹配 dt1 中的那些标题并尝试在 dt2 中的文本中找到它们:
如果它找到任何匹配项或近似匹配项,我希望在 dt2 中有一列以文本中提到的标题进行说明。如果提到不止一个,我想要一个用逗号分隔的任何标题。
dt2
id Text mentions
1. "I really liked V for Vendetta" "V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... " "Bottersnikes & Gumbles"
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..." "Bottersnikes & Gumbles"
4 "@thewitcher was an interesting series NA
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta" "Secret in Their Eyes, V for Vendetta"
... etc
你可以通过 agrep()
进行模糊匹配,这里我用 lapply()
为每个标题使用它来为每个文本生成匹配的逻辑向量,然后使用 apply()
从这个匹配项跨越 data.frame 以创建匹配标题的向量。
您可以调整 max.distance
值,但这对您的示例来说效果很好。
dt1 <- data.frame(
title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
genre = c("Dramas", "Action & Adventure", "Kids' TV"),
stringsAsFactors = FALSE
)
dt2 <- data.frame(
id = 1:5,
Text = c(
"I really liked V for Vendetta",
"Bottersnikes & Gumbles was a great film .... ",
"In any case, in my opinion bottersnikes &gumbles was a great film ...",
"@thewitcher was an interesting series",
"Secret in Their Eye is a terrible film! but I Like V per Vendetta"
),
stringsAsFactors = FALSE
)
match_titles <- function(target, titles) {
matches <- lapply(titles, agrepl, target,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE
)
matched_titles <- apply(
data.frame(matches), 1,
function(y) paste(titles[y], collapse = ",")
)
matched_titles
}
dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
## id Text
## 1 1 I really liked V for Vendetta
## 2 2 Bottersnikes & Gumbles was a great film ....
## 3 3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4 4 @thewitcher was an interesting series
## 5 5 Secret in Their Eye is a terrible film! but I Like V per Vendetta
## titles
## 1 V for Vendetta
## 2 Bottersnikes & Gumbles
## 3 Bottersnikes & Gumbles
## 4
## 5 Secret in Their Eyes,V for Vendetta
我有以下包含电影片名和相应类型的数据集,而另一个数据集包含可能引用或不引用这些片名的纯文本:
dt1
title genre
Secret in Their Eyes Dramas
V for Vendetta Action & Adventure
Bottersnikes & Gumbles Kids' TV
... ...
和
dt2
id Text
1. "I really liked V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... "
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4 "@thewitcher was an interesting series
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
... etc
我想要获得的是一个函数,它匹配 dt1 中的那些标题并尝试在 dt2 中的文本中找到它们:
如果它找到任何匹配项或近似匹配项,我希望在 dt2 中有一列以文本中提到的标题进行说明。如果提到不止一个,我想要一个用逗号分隔的任何标题。
dt2
id Text mentions
1. "I really liked V for Vendetta" "V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... " "Bottersnikes & Gumbles"
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..." "Bottersnikes & Gumbles"
4 "@thewitcher was an interesting series NA
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta" "Secret in Their Eyes, V for Vendetta"
... etc
你可以通过 agrep()
进行模糊匹配,这里我用 lapply()
为每个标题使用它来为每个文本生成匹配的逻辑向量,然后使用 apply()
从这个匹配项跨越 data.frame 以创建匹配标题的向量。
您可以调整 max.distance
值,但这对您的示例来说效果很好。
dt1 <- data.frame(
title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
genre = c("Dramas", "Action & Adventure", "Kids' TV"),
stringsAsFactors = FALSE
)
dt2 <- data.frame(
id = 1:5,
Text = c(
"I really liked V for Vendetta",
"Bottersnikes & Gumbles was a great film .... ",
"In any case, in my opinion bottersnikes &gumbles was a great film ...",
"@thewitcher was an interesting series",
"Secret in Their Eye is a terrible film! but I Like V per Vendetta"
),
stringsAsFactors = FALSE
)
match_titles <- function(target, titles) {
matches <- lapply(titles, agrepl, target,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE
)
matched_titles <- apply(
data.frame(matches), 1,
function(y) paste(titles[y], collapse = ",")
)
matched_titles
}
dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
## id Text
## 1 1 I really liked V for Vendetta
## 2 2 Bottersnikes & Gumbles was a great film ....
## 3 3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4 4 @thewitcher was an interesting series
## 5 5 Secret in Their Eye is a terrible film! but I Like V per Vendetta
## titles
## 1 V for Vendetta
## 2 Bottersnikes & Gumbles
## 3 Bottersnikes & Gumbles
## 4
## 5 Secret in Their Eyes,V for Vendetta