将字符串列表与文本块匹配
Match list of strings with a block of text
这里是初学者:
我有一段文字:
例如:'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'
和单词列表:['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']
我的最终目标是从单词列表中找到与文本块匹配的字符串 matches/fuzzy。
我尝试了什么:difflib.get_close_matches
需要输出:'angiotensin enzyme serum'
、'angiotensin enzyme a1'
输出顺序不是问题。
对于其他文本块,列表中的其他一些字符串会匹配。块不是常数。
有办法实现吗?
使用 fuzzywuzzy
(来自 PyPi):
from fuzzywuzzy import fuzz
text = 'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'
words = ['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']
matches = [w for w in words if fuzz.partial_ratio(text, w) > 70.]
显然您需要调整阈值以适应,但在此示例中这些值被很好地分开:
>>> print(matches)
['angiotensin enzyme serum', 'angiotensin enzyme a1']
>>> for w in words:
... print(w, fuzz.partial_ratio(text, w))
...
angiotensin enzyme serum 83
some diff enzyme 56
angiotensin enzyme a1 90
这里是初学者:
我有一段文字:
例如:'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'
和单词列表:['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']
我的最终目标是从单词列表中找到与文本块匹配的字符串 matches/fuzzy。
我尝试了什么:difflib.get_close_matches
需要输出:'angiotensin enzyme serum'
、'angiotensin enzyme a1'
输出顺序不是问题。
对于其他文本块,列表中的其他一些字符串会匹配。块不是常数。
有办法实现吗?
使用 fuzzywuzzy
(来自 PyPi):
from fuzzywuzzy import fuzz
text = 'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'
words = ['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']
matches = [w for w in words if fuzz.partial_ratio(text, w) > 70.]
显然您需要调整阈值以适应,但在此示例中这些值被很好地分开:
>>> print(matches)
['angiotensin enzyme serum', 'angiotensin enzyme a1']
>>> for w in words:
... print(w, fuzz.partial_ratio(text, w))
...
angiotensin enzyme serum 83
some diff enzyme 56
angiotensin enzyme a1 90