提取段落中与列表中的单词相似的单词
Extract words in a paragraph that are similar to words in list
我有以下字符串:
"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
要提取的单词列表:
["town","teddy","chicken","boy went"]
注意:给定句子中的 town 和 teddy 拼写错误。
我尝试了以下方法,但我得到了不属于答案的其他词:
import difflib
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town","teddy","chicken","boy went"]
[difflib.get_close_matches(x.lower().strip(), sent.split()) for x in list1 ]
我得到以下结果:
[['twn', 'to'], ['tddy'], ['chicken.', 'picked'], ['went']]
而不是:
'twn', 'tddy', 'chicken','boy went'
difflib.get_closest_matches()
文档中的通知:
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Return a list of the best "good enough" matches. word
is a sequence for which close matches are desired (typically a string), and
possibilities
is a list of sequences against which to match word
(typically a list of strings).
Optional argument n
(default 3
) is the maximum number of close matches to return; n
must be greater than 0
.
Optional argument cutoff
(default 0.6
) is a float in the range [0, 1]
. Possibilities that don’t score at least that similar to word are
ignored.
目前,您正在使用默认的 n
和 cutoff
参数。
您可以指定其中一个(或两者),以缩小返回的匹配范围。
例如,您可以使用 cutoff
0.75 的分数:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), cutoff=0.75) for x in list1]
或者,您可以指定最多只返回 1 个匹配项:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), n=1) for x in list1]
在任何一种情况下,您都可以使用列表理解来展平列表的列表(因为 difflib.get_close_matches()
总是 returns 一个列表):
matches = [r[0] for r in result]
由于您还想检查二元语法的紧密匹配,您可以通过提取相邻“单词”的配对并将它们作为 possibilities
参数的一部分传递给 difflib.get_close_matches()
来实现。
这是一个完整的实际工作示例:
import difflib
import re
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town", "teddy", "chicken", "boy went"]
# this extracts overlapping pairings of "words"
# i.e. ['The boy', 'boy went', 'went to', 'to twn', ...
pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))', sent)
# we pass the sent.split() list as before
# and concatenate the new pairs list to the end of it also
result = [difflib.get_close_matches(x.lower().strip(), sent.split() + pairs, n=1) for x in list1]
matches = [r[0] for r in result]
print(matches)
# ['twn', 'tddy', 'chicken.', 'boy went']
如果您阅读 Python 文档 fordifflib.get_close_matches()
https://docs.python.org/3/library/difflib.html
它 return 是所有可能的最佳匹配。
方法签名:
difflib.get_close_matches(单词,可能性,n=3,截断值=0.6)
这里的 n 是接近 return 的最大匹配数。所以我认为你可以将其作为 1.
>>> [difflib.get_close_matches(x.lower().strip(), sent.split(),1)[0] for x in list1]
['twn', 'tddy', 'chicken.', 'went']
我有以下字符串:
"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
要提取的单词列表:
["town","teddy","chicken","boy went"]
注意:给定句子中的 town 和 teddy 拼写错误。
我尝试了以下方法,但我得到了不属于答案的其他词:
import difflib
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town","teddy","chicken","boy went"]
[difflib.get_close_matches(x.lower().strip(), sent.split()) for x in list1 ]
我得到以下结果:
[['twn', 'to'], ['tddy'], ['chicken.', 'picked'], ['went']]
而不是:
'twn', 'tddy', 'chicken','boy went'
difflib.get_closest_matches()
文档中的通知:
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Return a list of the best "good enough" matches.
word
is a sequence for which close matches are desired (typically a string), andpossibilities
is a list of sequences against which to matchword
(typically a list of strings).Optional argument
n
(default3
) is the maximum number of close matches to return;n
must be greater than0
.Optional argument
cutoff
(default0.6
) is a float in the range[0, 1]
. Possibilities that don’t score at least that similar to word are ignored.
目前,您正在使用默认的 n
和 cutoff
参数。
您可以指定其中一个(或两者),以缩小返回的匹配范围。
例如,您可以使用 cutoff
0.75 的分数:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), cutoff=0.75) for x in list1]
或者,您可以指定最多只返回 1 个匹配项:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), n=1) for x in list1]
在任何一种情况下,您都可以使用列表理解来展平列表的列表(因为 difflib.get_close_matches()
总是 returns 一个列表):
matches = [r[0] for r in result]
由于您还想检查二元语法的紧密匹配,您可以通过提取相邻“单词”的配对并将它们作为 possibilities
参数的一部分传递给 difflib.get_close_matches()
来实现。
这是一个完整的实际工作示例:
import difflib
import re
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town", "teddy", "chicken", "boy went"]
# this extracts overlapping pairings of "words"
# i.e. ['The boy', 'boy went', 'went to', 'to twn', ...
pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))', sent)
# we pass the sent.split() list as before
# and concatenate the new pairs list to the end of it also
result = [difflib.get_close_matches(x.lower().strip(), sent.split() + pairs, n=1) for x in list1]
matches = [r[0] for r in result]
print(matches)
# ['twn', 'tddy', 'chicken.', 'boy went']
如果您阅读 Python 文档 fordifflib.get_close_matches() https://docs.python.org/3/library/difflib.html 它 return 是所有可能的最佳匹配。 方法签名: difflib.get_close_matches(单词,可能性,n=3,截断值=0.6)
这里的 n 是接近 return 的最大匹配数。所以我认为你可以将其作为 1.
>>> [difflib.get_close_matches(x.lower().strip(), sent.split(),1)[0] for x in list1]
['twn', 'tddy', 'chicken.', 'went']