如何将单词模糊匹配到句子中的一个完整单词（且仅是一个完整单词）？

Question

大多数 commonly misspelled English words 都在两个或三个印刷错误（替换 s、插入 i 或字母的组合从正确的形式中删除 d)。 IE。单词对 absence - absense 中的错误可以概括为有 1 s、0 i 和 0 d.

可以使用 to-replace-re regex python module.

模糊匹配来查找单词及其拼写错误

以下 table 总结了从一些句子中模糊分割感兴趣的单词的尝试：

Regex1 在 sentence 中找到最佳 word 匹配，最多允许 2 错误
Regex2 在 sentence 中找到最佳 word 匹配，允许尝试仅对（我认为）整个单词进行操作时出现最多 2 个错误
Regex3 在 sentence 中找到最佳 word 匹配，允许大多数 2 个错误，同时只对（我认为）整个单词进行操作。不知何故我错了。
Regex4 在 sentence 中找到最佳 word 匹配，允许大多数 2 个错误，同时（我认为）寻找匹配的结尾是一个单词边界

我将如何编写一个正则表达式来消除这些词句对的假阳性和假阴性模糊匹配？

一个可能的解决方案是只比较句子中的单词（由白色 space 或一行的 beginning/end 包围的字符串）与感兴趣的单词（主要单词）。如果主词与句子中的某个词之间存在模糊匹配 (e<=2)，则 return 句子中的完整词（且仅该词）。

代码

将以下数据框复制到剪贴板：

            word                  sentence
0      cub cadet              cub cadet 42
1        plastex              vinyl panels
2            spt  heat and air conditioner
3     closetmaid                closetmaid
4          ryobi           batteries kyobi
5          ryobi       10' table saw ryobi
6  trafficmaster           traffic mast5er

现在使用

import pandas as pd, regex
df=pd.read_clipboard(sep='\s\s+')

test=df
test['(?b)(?:WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?b)(?:\wWORD\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:\w'+x['word']+'\W){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:\w&&WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:\w&&'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:WORD&&\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:'+x['word']+'&&\W){e<=2}', x['sentence']),axis=1)

将 table 加载到您的环境中。

Answer 1

做'(?b)\m(?:WORD){e<=2}\M'

如何将单词模糊匹配到句子中的一个完整单词（且仅是一个完整单词）？

How do I fuzzy match word to a full word (and only full word) in a sentence?

python

regex

fuzzy-search

代码