Python:检查句子是否包含列表中的任何单词(模糊匹配)
Python: Check if the sentence contains any word from List (with fuzzy match)
我想从给定 list_of_keywords 的句子中提取关键字。
我设法提取了准确的单词
[word for word in Sentence if word in set(list_of_keywords)]
是否可以提取与给定list_of_keywords具有良好相似性的词,即两个词之间的余弦相似度 > 0.8
例如,给定列表中的关键字是'allergy',现在句子写成
'a severe allergic reaction to nuts in the meal she had consumed.'
'allergy'和'allergic'之间的余弦距离可以计算如下
cosdis(word2vec('allergy'), word2vec('allergic'))
Out[861]: 0.8432740427115677
如何根据余弦相似度从句子中提取'allergic'?
senectence = 'a severe allergic reaction to nuts in the meal she had consumed.'
list_of_keywords = ['allergy','reaction']
word_list = []
for keyword in list_of_keywords:
for word in senectence.split():
if(cosdis(word2vec(keyword), word2vec(word)) > 0.8):
word_list.append(word)
或者如果您只想根据关键字 'allergy'
提取单词
[word for word in Sentence if cosdis(word2vec('allergy'), word2vec(word)) > 0.8]
def word2vec(word):
from collections import Counter
from math import sqrt
# count the characters in word
cw = Counter(word)
# precomputes a set of the different characters
sw = set(cw)
# precomputes the "length" of the word vector
lw = sqrt(sum(c*c for c in cw.values()))
# return a tuple
return cw, sw, lw
def cosdis(v1, v2):
# which characters are common to the two words?
common = v1[1].intersection(v2[1])
# by definition of cosine distance we have
return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
list_of_keywords = ['allergy', 'something']
Sentence = 'a severe allergic reaction to nuts in the meal she had consumed.'
threshold = 0.80
for key in list_of_keywords:
for word in Sentence.split():
try:
# print(key)
# print(word)
res = cosdis(word2vec(word), word2vec(key))
# print(res)
if res > threshold:
print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
except IndexError:
pass
输出:
Found a word with cosine distance > 80 : allergic with original word: allergy
编辑:
单线杀手:
print([x for x in Sentence.split() for y in list_of_keywords if cosdis(word2vec(x), word2vec(y)) > 0.8])
输出:
['allergic']
单词的距离必须针对所有关键字进行检查,并且仅当任何关键字达到阈值时才会包括在内。我在原始列表理解中加入了一个额外的条件,嵌套列表理解正是这样做的。
def distance(words):
return cosdis(word2vec(words[0]), word2vec(words[1]))
threshold = 0.8
keywords = set(list_of_keywords)
matches = [word for word in Sentence if word in keywords and
any([distance(word, keyword) > threshold for keyword in keywords])]
我想从给定 list_of_keywords 的句子中提取关键字。
我设法提取了准确的单词
[word for word in Sentence if word in set(list_of_keywords)]
是否可以提取与给定list_of_keywords具有良好相似性的词,即两个词之间的余弦相似度 > 0.8
例如,给定列表中的关键字是'allergy',现在句子写成
'a severe allergic reaction to nuts in the meal she had consumed.'
'allergy'和'allergic'之间的余弦距离可以计算如下
cosdis(word2vec('allergy'), word2vec('allergic'))
Out[861]: 0.8432740427115677
如何根据余弦相似度从句子中提取'allergic'?
senectence = 'a severe allergic reaction to nuts in the meal she had consumed.'
list_of_keywords = ['allergy','reaction']
word_list = []
for keyword in list_of_keywords:
for word in senectence.split():
if(cosdis(word2vec(keyword), word2vec(word)) > 0.8):
word_list.append(word)
或者如果您只想根据关键字 'allergy'
提取单词[word for word in Sentence if cosdis(word2vec('allergy'), word2vec(word)) > 0.8]
def word2vec(word):
from collections import Counter
from math import sqrt
# count the characters in word
cw = Counter(word)
# precomputes a set of the different characters
sw = set(cw)
# precomputes the "length" of the word vector
lw = sqrt(sum(c*c for c in cw.values()))
# return a tuple
return cw, sw, lw
def cosdis(v1, v2):
# which characters are common to the two words?
common = v1[1].intersection(v2[1])
# by definition of cosine distance we have
return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
list_of_keywords = ['allergy', 'something']
Sentence = 'a severe allergic reaction to nuts in the meal she had consumed.'
threshold = 0.80
for key in list_of_keywords:
for word in Sentence.split():
try:
# print(key)
# print(word)
res = cosdis(word2vec(word), word2vec(key))
# print(res)
if res > threshold:
print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
except IndexError:
pass
输出:
Found a word with cosine distance > 80 : allergic with original word: allergy
编辑:
单线杀手:
print([x for x in Sentence.split() for y in list_of_keywords if cosdis(word2vec(x), word2vec(y)) > 0.8])
输出:
['allergic']
单词的距离必须针对所有关键字进行检查,并且仅当任何关键字达到阈值时才会包括在内。我在原始列表理解中加入了一个额外的条件,嵌套列表理解正是这样做的。
def distance(words):
return cosdis(word2vec(words[0]), word2vec(words[1]))
threshold = 0.8
keywords = set(list_of_keywords)
matches = [word for word in Sentence if word in keywords and
any([distance(word, keyword) > threshold for keyword in keywords])]