使用 Difflib get_matching_blocks 的模糊字符串匹配未检测到所有子字符串

Question

我正在尝试查找段落中出现的所有单词，我希望它也能解释拼写错误。代码：

to_search="caterpillar"
search_here= "caterpillar are awesome animal catterpillar who like other humans but not other caterpilar"
#search_here has the word caterpillar repeated but with spelling mistakes

s= SequenceMatcher(None, to_search, search_here).get_matching_blocks()
print(s)

#Output  : [Match(a=0, b=0, size=11), Match(a=3, b=69, size=0)] 
#Expected: [Match(a=0, b=0, size=11), Match(a=0, b=32, size=11), Match(a=0, b=81, size=11)]

Difflib get_matching_blocks 仅检测 search_here 字符串中 "caterpillar" 的第一个实例。我希望它给我所有紧密匹配的块的输出，即它应该标识 "caterpillar"、"catterpillar" 和 "caterpilar"

我该如何解决这个问题？

Answer 1

你可以计算出每个词与to_search的编辑距离。然后，您可以 select 所有具有 "low enough" 编辑距离的单词（分数为 0 表示完全匹配）。

感谢您的问题，我发现有一个 pip-install-able edit_distance Python 模块。这是我第一次尝试的几个例子：

>>> edit_distance.SequenceMatcher('fabulous', 'fibulous').ratio()
0.875
>>> edit_distance.SequenceMatcher('fabulous', 'wonderful').ratio()
0.11764705882352941
>>> edit_distance.SequenceMatcher('fabulous', 'fabulous').ratio()
1.0
>>> edit_distance.SequenceMatcher('fabulous', '').ratio()
0.0
>>> edit_distance.SequenceMatcher('caterpillar', 'caterpilar').ratio()
0.9523809523809523

因此，比率方法似乎为您提供了一个介于 0 和 1（含）之间的数字，其中 1 是完全匹配而 0 是...甚至不在同一个联赛中 XD。所以是的，您可以 select 比率大于 1 - epsilon 的词，其中 epsilon 可能是 0.1 左右。

使用 Difflib get_matching_blocks 的模糊字符串匹配未检测到所有子字符串

Fuzzy string matching using Difflib get_matching_blocks not detecting all substrings

python

string

difflib

sequencematcher