Jaro-Winkler 的包之间的区别
Jaro-Winkler's difference between packages
我正在使用模糊匹配来清理用户输入的药物数据,我正在使用 Jaro-Winkler 的距离。当我注意到默认设置不提供相同的值时,我正在测试哪个包与 Jaro-Winkler 的距离更快。任何人都可以帮助我了解差异的来源吗?示例:
library(RecordLinkage)
library(stringdist)
jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667
我假设它与权重有关,而且我知道我在两者上都使用了默认值。但是,如果有更多经验的人可以阐明正在发生的事情,我将不胜感激。谢谢!
文档:
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf
https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf
隐藏在 stringdist
的文档中的内容如下:
The Jaro-Winkler distance (method=jw
, 0<p<=0.25
) adds a correction term to the Jaro-distance. It is defined as d − l · p · d
, where d
is the Jaro-distance. Here, l
is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p
is a penalty factor, which in the work of Winkler is often chosen 0.1.
但是,在stringdist::stringdist
中默认为p = 0
。因此:
1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"),
method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
事实上那个值是hard-coded in the source of RecordLinkage::jarowinkler
。
我正在使用模糊匹配来清理用户输入的药物数据,我正在使用 Jaro-Winkler 的距离。当我注意到默认设置不提供相同的值时,我正在测试哪个包与 Jaro-Winkler 的距离更快。任何人都可以帮助我了解差异的来源吗?示例:
library(RecordLinkage)
library(stringdist)
jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667
我假设它与权重有关,而且我知道我在两者上都使用了默认值。但是,如果有更多经验的人可以阐明正在发生的事情,我将不胜感激。谢谢!
文档:
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf
隐藏在 stringdist
的文档中的内容如下:
The Jaro-Winkler distance (
method=jw
,0<p<=0.25
) adds a correction term to the Jaro-distance. It is defined asd − l · p · d
, whered
is the Jaro-distance. Here,l
is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factorp
is a penalty factor, which in the work of Winkler is often chosen 0.1.
但是,在stringdist::stringdist
中默认为p = 0
。因此:
1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"),
method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
事实上那个值是hard-coded in the source of RecordLinkage::jarowinkler
。