R - return n 通过编辑距离进行匹配
R - return n matches via levenshtein distance
我想通过 levenshtein 距离找到给定字符串的 n 个最佳匹配。我知道 R 中的 adist
函数给出了最小距离,但我试图将结果的数量缩放到,比如说,10。我在下面有一些代码。
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
"joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
dist.mat.nm <- adist(source1$name, source2$name, partial = T, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address.full, source2$address.full, partial = TRUE, ignore.case = TRUE)
dist.mat <- ifelse(is.na(dist.mat.nm), dist.mat.ad, dist.mat.nm)
dist.mat2 <- ifelse(is.na(dist.mat.ad), dist.mat.nm, dist.mat.ad)
which.match <- function(x, nm) return(nm[which(x == min(x))[1]])
which.index <- function(x, nm) return(which(x == min(x))[1])
source2.matches.name <- apply(dist.mat, 1, which.match, nm = source2$name)
source2.name.index <- apply(dist.mat, 1, which.index, nm =
source2$names[source2.matches.name])
所需的结果是一个包含 source1$name
的数据框,以及使用 adist
基于 lev 距离的最佳 5 匹配列,以及 source1$address
及其最佳 5火柴。也许使用 dplyr
中的 top_n
?如果有任何不清楚的地方,请告诉我。任何帮助深表感谢。谢谢。
如果我理解这个问题,下面就是你想要的。
首先,我将重新运行创建 dist.mat.ad
的代码行,因为您的代码有错误,它引用名为 address
.
的列 address.full
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
现在就是你想要的结果。
imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)
imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)
结果在top.nm
和top.ad
中。
最后清理。
rm(imat, tmp)
我想通过 levenshtein 距离找到给定字符串的 n 个最佳匹配。我知道 R 中的 adist
函数给出了最小距离,但我试图将结果的数量缩放到,比如说,10。我在下面有一些代码。
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
"joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
dist.mat.nm <- adist(source1$name, source2$name, partial = T, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address.full, source2$address.full, partial = TRUE, ignore.case = TRUE)
dist.mat <- ifelse(is.na(dist.mat.nm), dist.mat.ad, dist.mat.nm)
dist.mat2 <- ifelse(is.na(dist.mat.ad), dist.mat.nm, dist.mat.ad)
which.match <- function(x, nm) return(nm[which(x == min(x))[1]])
which.index <- function(x, nm) return(which(x == min(x))[1])
source2.matches.name <- apply(dist.mat, 1, which.match, nm = source2$name)
source2.name.index <- apply(dist.mat, 1, which.index, nm =
source2$names[source2.matches.name])
所需的结果是一个包含 source1$name
的数据框,以及使用 adist
基于 lev 距离的最佳 5 匹配列,以及 source1$address
及其最佳 5火柴。也许使用 dplyr
中的 top_n
?如果有任何不清楚的地方,请告诉我。任何帮助深表感谢。谢谢。
如果我理解这个问题,下面就是你想要的。
首先,我将重新运行创建 dist.mat.ad
的代码行,因为您的代码有错误,它引用名为 address
.
address.full
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
现在就是你想要的结果。
imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)
imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)
结果在top.nm
和top.ad
中。
最后清理。
rm(imat, tmp)