文本R中的近似匹配和替换

Question

我有一句话只想用数字替换字符串的一部分。如果我们有一个完全匹配的 gsub 函数就可以完美地工作。

gsub('great thing', 5555 ,c('hey this is a great thing'))
gsub('good rabbit', 5555 ,c('hey this is a good rabbit in the field'))

但是现在我遇到了以下问题。如果字符串的一部分有错误，如何将模糊匹配函数应用于字符串？

gsub('great thing', 5555 ,c('hey this is a graet thing'))
gsub('good rabbit', 5555 ,c('hey this is a goood rabit in the field'))

算法应该计算出 "great thing" 和 "graet thing" 或 "good rabbit" 和 "goood rabit" 非常相似，应该用数字 5555 代替。如果我们能做到最好使用 Jaro Winkler 距离在字符串中找到近似匹配，然后替换近似子字符串。我需要一个非常抽象的算法来做到这一点。

有什么想法吗？

Answer 1

一些agrep例子：

agrep("lasy", "1 lazy 2")
agrep("lasy", "1 lazy 2", max = list(sub = 0))
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2)
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, ignore.case = TRUE)

agrep 在基地。如果你加载 stringdist，你可以使用 Jarro-Winkler 和（你猜对了）stringdist 来计算字符串距离，或者如果你很懒，你可以只使用 ain 或 amatch。出于我的目的，我倾向于更多地使用 Damerau–Levenshtein (method="dl")，但您的情况可能会有所不同。

请务必在使用算法之前仔细阅读算法参数的确切工作原理（即将 p、q 和 maxDist 值设置为对您正在做的事情有意义的水平）

文本R中的近似匹配和替换

Approximate Matching and Replacement in Text R

text

fuzzy-search

r

matching