何时使用哪个模糊函数比较 2 个字符串

Question

我正在 Python 学习 fuzzywuzzy。

我理解 fuzz.ratio、fuzz.partial_ratio、fuzz.token_sort_ratio 和 fuzz.token_set_ratio 的概念。我的问题是什么时候使用哪个函数？

是否先检查2个字符串的长度，如果不相似，再判断出 fuzz.partial_ratio?
如果两个字符串的长度相似，我会使用 fuzz.token_sort_ratio？
我应该一直使用 fuzz.token_set_ratio 吗？

有人知道 SeatGeek 使用什么标准吗？

我正在尝试建立一个房地产网站，想用fuzzywuzzy来比较地址。

Answer 1

好问题。

我是 SeatGeek 的一名工程师，所以我想我可以在这里提供帮助。我们有一个很好的 blog post 很好地解释了差异，但我可以总结并提供一些关于我们如何使用不同类型的见解。

概览

在幕后，四种方法中的每一种都计算两个输入字符串中标记的某些排序之间的编辑距离。这是使用 difflib.ratio 函数 which will:

完成的

Return a measure of the sequences' similarity (float in [0,1]).

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1 if the sequences are identical, and 0 if they have nothing in common.

四种 fuzzywuzzy 方法对输入字符串的不同组合调用 difflib.ratio。

fuzz.ratio

简单。只需在两个输入字符串 (code) 上调用 difflib.ratio。

fuzz.ratio("NEW YORK METS", "NEW YORK MEATS")
> 96

fuzz.partial_ratio

尝试更好地解释部分字符串匹配。使用最短字符串（长度 n）对较大字符串的所有 n 长度子字符串调用 ratio，并 returns 最高分（code）。

注意这里"YANKEES"是最短的字符串（长度为7），我们运行与"YANKEES"与"NEW YORK YANKEES"的所有长度为7的子串的比值（这将包括检查 "YANKEES"，100% 匹配）：

fuzz.ratio("YANKEES", "NEW YORK YANKEES")
> 60
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES")
> 100

fuzz.token_sort_ratio

试图解释相似的乱序字符串。在对每个字符串中的标记排序后对两个字符串调用 ratio (code)。注意这里 fuzz.ratio 和 fuzz.partial_ratio 都失败了，但是一旦你对标记进行排序，它就是 100% 匹配：

fuzz.ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 45
fuzz.partial_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 45
fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 100

fuzz.token_set_ratio

尝试排除字符串中的差异。三个特定子字符串集的调用比率和 returns 最大值 (code):

仅交集以及与字符串 one 的其余部分的交集
仅交集和与字符串二的其余部分的交集
与一的余数相交与与二的余数相交

请注意，通过拆分两个字符串的交集和余数，我们正在考虑这两个字符串的相似程度和不同程度：

fuzz.ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 36
fuzz.partial_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 61
fuzz.token_sort_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 51
fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 91

申请

这就是魔法发生的地方。在 SeatGeek，我们基本上为每个数据点（地点、事件名称等）创建一个向量分数，并使用它来通知特定于我们问题域的相似性的程序决策。

话虽这么说，但事实并非如此 FuzzyWuzzy 对您的用例很有用。确定两个地址是否相似将非常糟糕。考虑 SeatGeek 总部的两个可能地址：“235 Park Ave Floor 12”和“235 Park Ave S. Floor 12”：

fuzz.ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 93
fuzz.partial_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 85
fuzz.token_sort_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 95
fuzz.token_set_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 100

FuzzyWuzzy 给这些字符串很高的匹配分数，但一个地址是我们在联合广场附近的实际办公室，另一个在中央车站的另一边。

对于你的问题你最好使用Google Geocoding API.

Answer 2

截至 2017 年 6 月，fuzzywuzzy 还包括一些其他比较功能。以下是已接受答案中遗漏的内容的概述（摘自 source code）：

fuzz.partial_token_sort_ratio

与 token_sort_ratio 中的算法相同，但不是在对标记排序后应用 ratio，而是使用 partial_ratio。

fuzz.token_sort_ratio("New York Mets vs Braves", "Atlanta Braves vs New York Mets")
> 85
fuzz.partial_token_sort_ratio("New York Mets vs Braves", "Atlanta Braves vs New York Mets")
> 100    
fuzz.token_sort_ratio("React.js framework", "React.js")
> 62
fuzz.partial_token_sort_ratio("React.js framework", "React.js")
> 100

fuzz.partial_token_set_ratio

与 token_set_ratio 中的算法相同，但不是将 ratio 应用于标记集，而是使用 partial_ratio。

fuzz.token_set_ratio("New York Mets vs Braves", "Atlanta vs New York Mets")
> 82
fuzz.partial_token_set_ratio("New York Mets vs Braves", "Atlanta vs New York Mets")
> 100    
fuzz.token_set_ratio("React.js framework", "Reactjs")
> 40
fuzz.partial_token_set_ratio("React.js framework", "Reactjs")
> 71

fuzz.QRatio, fuzz.UQRatio

只是对 fuzz.ratio 进行了一些验证和短路包装，为了完整起见包含在此处。 UQRatio 是 QRatio 的 unicode 版本。

fuzz.WRatio

尝试对不同算法的结果进行加权（名称代表 'Weighted Ratio'）计算 'best' 分数。来自源码的说明：

1. Take the ratio of the two processed strings (fuzz.ratio)
2. Run checks to compare the length of the strings
    * If one of the strings is more than 1.5 times as long as the other
      use partial_ratio comparisons - scale partial results by 0.9
      (this makes sure only full results can return 100)
    * If one of the strings is over 8 times as long as the other
      instead scale by 0.6
3. Run the other ratio functions
    * if using partial ratio functions call partial_ratio,
      partial_token_sort_ratio and partial_token_set_ratio
      scale all of these by the ratio based on length
    * otherwise call token_sort_ratio and token_set_ratio
    * all token based comparisons are scaled by 0.95
      (on top of any partial scalars)
4. Take the highest value from these results
   round it and return it as an integer.

fuzz.UWRatio

WRatio 的 Unicode 版本。