将字符串列表与文本块匹配

Match list of strings with a block of text

这里是初学者:

我有一段文字:

例如:'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'

和单词列表:['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']

我的最终目标是从单词列表中找到与文本块匹配的字符串 matches/fuzzy。

我尝试了什么:difflib.get_close_matches

需要输出:'angiotensin enzyme serum''angiotensin enzyme a1'

输出顺序不是问题。

对于其他文本块,列表中的其他一些字符串会匹配。块不是常数。

有办法实现吗?

使用 fuzzywuzzy(来自 PyPi):

from fuzzywuzzy import fuzz

text = 'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'

words = ['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']

matches = [w for w in words if fuzz.partial_ratio(text, w) > 70.]

显然您需要调整阈值以适应,但在此示例中这些值被很好地分开:

>>> print(matches)
['angiotensin enzyme serum', 'angiotensin enzyme a1']

>>> for w in words:
...     print(w, fuzz.partial_ratio(text, w))
... 
angiotensin enzyme serum 83
some diff enzyme 56
angiotensin enzyme a1 90