更快更高效的 python 模糊匹配子串方法

Question

我希望程序使用模糊匹配搜索所有出现的鳄鱼等，即如果有任何拼写错误，它也应该计算这些单词。

s="Difference between a crocodile and an alligator is......." #Long paragraph, >10000 words
to_search=["crocodile","insect","alligator"]

for i in range(len(to_search)):
    for j in range(len(s)):
        a = s[j:j+len(to_search[i])]
        match = difflib.SequenceMatcher(None,a,to_search[I]).ratio()
        if(match>0.9): #90% similarity
            print(a)

因此，以下所有内容都应被视为 "crocodile" 的实例："crocodile"、"crocodil"、"crocodele"、等等

上述方法有效，但如果主字符串（此处为 "s"）的长度超过 100 万字，则速度太慢。有没有比上述方法**更快的方法？

**（将字符串拆分为子字符串大小的块，然后将子字符串与参考词进行比较）

Answer 1

在大量文本上花费太长时间的原因之一是您在整个文本中重复滑动 window 多次，对于您要搜索的每个单词一次。很多计算是将您的单词与可能包含多个单词的部分的相同长度的块进行比较。

如果您愿意假设您总是希望匹配单个单词，则可以将文本拆分为单词，然后只与单词进行比较 - 比较次数要少得多（单词数量与 windows从文本中的每个位置开始），并且拆分只需要进行一次，而不是针对每个搜索词。这是一个例子：

to_search= ["crocodile", "insect", "alligator"]
s = "Difference between a crocodile and an alligator is" #Long paragraph, >10000 words
s_words = s.replace(".", " ").split(" ") # Split on spaces, with periods removed
for search_for in to_search:
    for s_word in s_words:
        match = difflib.SequenceMatcher(None, s_word, search_for).ratio()
        if(match > 0.9):  #90% similarity
            print(s_word)
            continue      # no longer need to continue the search for this word!

这应该会给您带来显着的加速，希望它能解决您的需求！

编码愉快！

更快更高效的 python 模糊匹配子串方法

Faster and more efficient python method for fuzzy matching substrings

python

string

difflib

python-3.x