计算单词列表之间的相似度
Calculate similarity between list of words
我想计算两个单词列表之间的相似度,例如:
['email','user','this','email','address','customer']
类似于此列表:
['email','mail','address','netmail']
我想要比另一个列表具有更高百分比的相似度,例如:
['address','ip','network']
即使列表中存在 address
。
由于您还没有真正能够演示 crystal 输出,这是我最好的镜头:
list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
在上面的两个列表中,我们将求列表中每个元素与其余元素的余弦相似度。即 list_B
中的 email
以及 list_A
中的每个元素:
def word2vec(word):
from collections import Counter
from math import sqrt
# count the characters in word
cw = Counter(word)
# precomputes a set of the different characters
sw = set(cw)
# precomputes the "length" of the word vector
lw = sqrt(sum(c*c for c in cw.values()))
# return a tuple
return cw, sw, lw
def cosdis(v1, v2):
# which characters are common to the two words?
common = v1[1].intersection(v2[1])
# by definition of cosine distance we have
return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
threshold = 0.80 # if needed
for key in list_A:
for word in list_B:
try:
# print(key)
# print(word)
res = cosdis(word2vec(word), word2vec(key))
# print(res)
print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
# if res > threshold:
# print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
except IndexError:
pass
输出:
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : user is: 22.360679774997898
The cosine similarity between : mail and : user is: 0.0
The cosine similarity between : address and : user is: 60.30226891555272
The cosine similarity between : netmail and : user is: 18.89822365046136
The cosine similarity between : email and : this is: 22.360679774997898
The cosine similarity between : mail and : this is: 25.0
The cosine similarity between : address and : this is: 30.15113445777636
The cosine similarity between : netmail and : this is: 37.79644730092272
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : address is: 26.967994498529684
The cosine similarity between : mail and : address is: 15.07556722888818
The cosine similarity between : address and : address is: 100.0
The cosine similarity between : netmail and : address is: 22.79211529192759
The cosine similarity between : email and : customer is: 31.62277660168379
The cosine similarity between : mail and : customer is: 17.677669529663685
The cosine similarity between : address and : customer is: 42.640143271122085
The cosine similarity between : netmail and : customer is: 40.08918628686365
Note: I have also commented the threshold
part in the code, in case
you only want the words if their similarity exceeds a certain
threshold i.e. 80%
编辑:
OP: 但我真正想做的不是逐字比较,而是逐个列表
使用 Counter
和 math
:
from collections import Counter
import math
counterA = Counter(list_A)
counterB = Counter(list_B)
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
print(counter_cosine_similarity(counterA, counterB) * 100)
输出:
53.03300858899106
您可以利用 Scikit-Learn(或其他 NLP)库的强大功能来完成此任务。下面的示例使用 CountVectorizer,但对于更复杂的文档分析,最好改用 TFIDF 矢量化器。
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def vect_cos(vect, test_list):
""" Vectorise text and compute the cosine similarity """
query_0 = vect.transform([' '.join(vect.get_feature_names())])
query_1 = vect.transform(test_list)
cos_sim = cosine_similarity(query_0.A, query_1.A) # displays the resulting matrix
return query_1, np.round(cos_sim.squeeze(), 3)
# Train the vectorizer
vocab=['email','user','this','email','address','customer']
vectoriser = CountVectorizer().fit(vocab)
vectoriser.vocabulary_ # show the word-matrix position pairs
# Analyse list_1
list_1 = ['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
# Analyse list_2
list_2 = ['address','ip','network']
list_2_vect, list_2_cos = vect_cos(vectoriser, [' '.join(list_2)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
print('\nThe cosine similarity for the second list is {}.'.format(list_2_cos))
输出
The cosine similarity for the first list is 0.632.
The cosine similarity for the second list is 0.447.
编辑
如果您想计算 "e-mail" 与任何其他字符串列表之间的余弦相似度,请使用 "e-mail" 训练向量化器,然后分析其他文档。
# Train the vectorizer
vocab=['email']
vectoriser = CountVectorizer().fit(vocab)
# Analyse list_1
list_1 =['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
输出
The cosine similarity for the first list is 1.0.
我想计算两个单词列表之间的相似度,例如:
['email','user','this','email','address','customer']
类似于此列表:
['email','mail','address','netmail']
我想要比另一个列表具有更高百分比的相似度,例如:
['address','ip','network']
即使列表中存在 address
。
由于您还没有真正能够演示 crystal 输出,这是我最好的镜头:
list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
在上面的两个列表中,我们将求列表中每个元素与其余元素的余弦相似度。即 list_B
中的 email
以及 list_A
中的每个元素:
def word2vec(word):
from collections import Counter
from math import sqrt
# count the characters in word
cw = Counter(word)
# precomputes a set of the different characters
sw = set(cw)
# precomputes the "length" of the word vector
lw = sqrt(sum(c*c for c in cw.values()))
# return a tuple
return cw, sw, lw
def cosdis(v1, v2):
# which characters are common to the two words?
common = v1[1].intersection(v2[1])
# by definition of cosine distance we have
return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
threshold = 0.80 # if needed
for key in list_A:
for word in list_B:
try:
# print(key)
# print(word)
res = cosdis(word2vec(word), word2vec(key))
# print(res)
print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
# if res > threshold:
# print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
except IndexError:
pass
输出:
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : user is: 22.360679774997898
The cosine similarity between : mail and : user is: 0.0
The cosine similarity between : address and : user is: 60.30226891555272
The cosine similarity between : netmail and : user is: 18.89822365046136
The cosine similarity between : email and : this is: 22.360679774997898
The cosine similarity between : mail and : this is: 25.0
The cosine similarity between : address and : this is: 30.15113445777636
The cosine similarity between : netmail and : this is: 37.79644730092272
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : address is: 26.967994498529684
The cosine similarity between : mail and : address is: 15.07556722888818
The cosine similarity between : address and : address is: 100.0
The cosine similarity between : netmail and : address is: 22.79211529192759
The cosine similarity between : email and : customer is: 31.62277660168379
The cosine similarity between : mail and : customer is: 17.677669529663685
The cosine similarity between : address and : customer is: 42.640143271122085
The cosine similarity between : netmail and : customer is: 40.08918628686365
Note: I have also commented the
threshold
part in the code, in case you only want the words if their similarity exceeds a certain threshold i.e. 80%
编辑:
OP: 但我真正想做的不是逐字比较,而是逐个列表
使用 Counter
和 math
:
from collections import Counter
import math
counterA = Counter(list_A)
counterB = Counter(list_B)
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
print(counter_cosine_similarity(counterA, counterB) * 100)
输出:
53.03300858899106
您可以利用 Scikit-Learn(或其他 NLP)库的强大功能来完成此任务。下面的示例使用 CountVectorizer,但对于更复杂的文档分析,最好改用 TFIDF 矢量化器。
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def vect_cos(vect, test_list):
""" Vectorise text and compute the cosine similarity """
query_0 = vect.transform([' '.join(vect.get_feature_names())])
query_1 = vect.transform(test_list)
cos_sim = cosine_similarity(query_0.A, query_1.A) # displays the resulting matrix
return query_1, np.round(cos_sim.squeeze(), 3)
# Train the vectorizer
vocab=['email','user','this','email','address','customer']
vectoriser = CountVectorizer().fit(vocab)
vectoriser.vocabulary_ # show the word-matrix position pairs
# Analyse list_1
list_1 = ['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
# Analyse list_2
list_2 = ['address','ip','network']
list_2_vect, list_2_cos = vect_cos(vectoriser, [' '.join(list_2)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
print('\nThe cosine similarity for the second list is {}.'.format(list_2_cos))
输出
The cosine similarity for the first list is 0.632.
The cosine similarity for the second list is 0.447.
编辑
如果您想计算 "e-mail" 与任何其他字符串列表之间的余弦相似度,请使用 "e-mail" 训练向量化器,然后分析其他文档。
# Train the vectorizer
vocab=['email']
vectoriser = CountVectorizer().fit(vocab)
# Analyse list_1
list_1 =['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
输出
The cosine similarity for the first list is 1.0.