Fuzzy Matching/Join 大学名称的两个数据框
Fuzzy Matching/Join Two Data Frames of University Names
我有一个大学名称列表,其中包含拼写错误和不一致的输入。我需要将它们与正式的大学名称列表进行匹配,以 link 我的数据。
我知道模糊 matching/join 是我要走的路,但我对正确的方法有点迷茫。任何帮助将不胜感激。
d<-data.frame(name=c("University of New Yorkk", "The University of South
Carolina", "Syracuuse University", "University of South Texas",
"The University of No Carolina"), score = c(1,3,6,10,4))
y<-data.frame(name=c("University of South Texas", "The University of North
Carolina", "University of South Carolina", "Syracuse
University","University of New York"), distance = c(100, 400, 200, 20, 70))
而且我想要一个将它们尽可能紧密地合并在一起的输出
matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina",
"Syracuuse University","University of South Texas","The University of No Carolina"),
correctmatch = c("University of New York", "University of South Carolina",
"Syracuse University","University of South Texas", "The University of North Carolina"))
我使用 adist()
来处理这样的事情,并且有一个名为 closest_match()
的小包装函数来帮助将一个值与一组 "good/permitted" 值进行比较。
library(magrittr) # for the %>%
closest_match <- function(bad_value, good_values) {
distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
as.numeric() %>%
setNames(good_values)
distances[distances == min(distances)] %>%
names()
}
sapply(d$name, function(x) closest_match(x, y$name)) %>%
setNames(d$name)
University of New Yorkk The University of South\n Carolina Syracuuse University
"University of New York" "University of South Carolina" "University of New York"
University of South Texas The University of No Carolina
"University of South Texas" "University of South Carolina"
adist()
利用Levenshtein distance比较两个字符串之间的相似度。
我有一个大学名称列表,其中包含拼写错误和不一致的输入。我需要将它们与正式的大学名称列表进行匹配,以 link 我的数据。
我知道模糊 matching/join 是我要走的路,但我对正确的方法有点迷茫。任何帮助将不胜感激。
d<-data.frame(name=c("University of New Yorkk", "The University of South
Carolina", "Syracuuse University", "University of South Texas",
"The University of No Carolina"), score = c(1,3,6,10,4))
y<-data.frame(name=c("University of South Texas", "The University of North
Carolina", "University of South Carolina", "Syracuse
University","University of New York"), distance = c(100, 400, 200, 20, 70))
而且我想要一个将它们尽可能紧密地合并在一起的输出
matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina",
"Syracuuse University","University of South Texas","The University of No Carolina"),
correctmatch = c("University of New York", "University of South Carolina",
"Syracuse University","University of South Texas", "The University of North Carolina"))
我使用 adist()
来处理这样的事情,并且有一个名为 closest_match()
的小包装函数来帮助将一个值与一组 "good/permitted" 值进行比较。
library(magrittr) # for the %>%
closest_match <- function(bad_value, good_values) {
distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
as.numeric() %>%
setNames(good_values)
distances[distances == min(distances)] %>%
names()
}
sapply(d$name, function(x) closest_match(x, y$name)) %>%
setNames(d$name)
University of New Yorkk The University of South\n Carolina Syracuuse University
"University of New York" "University of South Carolina" "University of New York"
University of South Texas The University of No Carolina
"University of South Texas" "University of South Carolina"
adist()
利用Levenshtein distance比较两个字符串之间的相似度。