有没有其他方法可以找到低开销和高精度记录之间的相似性度量（Jaro-Winkler 算法除外）？

Question

我正在尝试在 python 中使用 Jaro Winkler 算法实现字符串之间的相似性度量，我正在使用 anaconda 环境并将其部署在阿里云 ECS 实例上。

我用来查找相似性的示例代码：

from pyjarowinkler import distance
print ("Average Score ---->", distance.get_jaro_distance("hello", "haloa"))

Average Score ---->0.76

当我处理 60 万条记录时，需要 20 多分钟。处理大量记录非常慢。有没有其他方法可以以低开销和高精度找到记录之间的相似性度量？

Answer 1

Jaro Winkler 距离，表示两个字符串之间的相似度分数。 Jaro 度量是每个文件和转置字符的匹配字符百分比的加权和。 Winkler 增加了此措施以匹配初始字符。

原始实现基于 Jaro Winkler 相似性算法文章，可在 Wikipedia. This Python version of the original implementation is based on the Apache StringUtils library 上找到。

与您将在 StringUtils 库中找到的类似的单元测试用于验证实现。

>>> from pyjarowinkler import distance
>>> # Scaling is 0.1 by default
>>> print distance.get_jaro_distance("hello", "haloa", winkler=True, scaling=0.1)
0.76
>>> print distance.get_jaro_distance("hello", "haloa", winkler=False, scaling=0.1)
0.733333333333

从this link

获取更多详细信息

希望这对您的查询有所帮助。

有没有其他方法可以找到低开销和高精度记录之间的相似性度量（Jaro-Winkler 算法除外）？

Is there any other way to find the similarity metric between the records with low overhead and high accuracy (other than Jaro-Winkler Algorithm)?

python

python-3.x

alibaba-cloud-ecs

alibaba-cloud