Jaro-Winkler 的包之间的区别

Jaro-Winkler's difference between packages

我正在使用模糊匹配来清理用户输入的药物数据,我正在使用 Jaro-Winkler 的距离。当我注意到默认设置不提供相同的值时,我正在测试哪个包与 Jaro-Winkler 的距离更快。任何人都可以帮助我了解差异的来源吗?示例:

library(RecordLinkage)
library(stringdist)

jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667

我假设它与权重有关,而且我知道我在两者上都使用了默认值。但是,如果有更多经验的人可以阐明正在发生的事情,我将不胜感激。谢谢!

文档:

https://cran.r-project.org/web/packages/stringdist/stringdist.pdf https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf

隐藏在 stringdist 的文档中的内容如下:

The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d − l · p · d, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a penalty factor, which in the work of Winkler is often chosen 0.1.

但是,在stringdist::stringdist中默认为p = 0。因此:

1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), 
               method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667

事实上那个值是hard-coded in the source of RecordLinkage::jarowinkler