是否有任何解决方案来获得单词列表之间的相似性分数?
Is there any solution to get score of similarity between lists of words?
我想计算单词列表之间的相似度,例如:
import math,re
from collections import Counter
test = ['address','ip']
list_a = ['identifiant', 'ip', 'address', 'fixe', 'horadatee', 'cookie', 'mac', 'machine', 'network', 'cable']
list_b = ['address','city']
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
print(c2.get('ip',0)**2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
counter1 = Counter(test)
counter2 = Counter(list_a)
counter3 = Counter(list_b)
score = counter_cosine_similarity(counter1,counter2)
print(score) # output : 0.4472135954999579
score = counter_cosine_similarity(counter1,counter3)
print(score) # output : 0.4999999999999999
对我来说,这不是我想要得到的分数,分数必须是相反的,因为 list_a 包含地址和 ip,所以它是 100% 测试匹配 我知道余弦相似度在这个中进行比较测试和 list_a 的情况,所以由于 list_a 上有一些元素不在测试中,这是因为分数很低,所以我会准确地比较测试与 list_a 以一种方式而不是以两种方式。
期望输出
score = counter_cosine_similarity(counter1,counter2)
print(score) # output : score higher than list_b = 1.0 may be
score = counter_cosine_similarity(counter1,counter3)
print(score) # output : score less the list_a = 0.5 may be
如果你想要更高的值,更多的条款是相同的,使用这个代码:
score = len(set(test).intersection(set(list_x)))
这会告诉您这两个列表有多少个常用术语。如果您想获得更高的重复分数,请尝试
commonTerms = set(test).intersection(set(list_x))
counter = Counter(list_x)
score = sum((counter.get(term) for term in commonTerms)) #edited
如果您需要将分数缩放到 [0..1],我需要了解更多关于您的数据集的信息。
我想计算单词列表之间的相似度,例如:
import math,re
from collections import Counter
test = ['address','ip']
list_a = ['identifiant', 'ip', 'address', 'fixe', 'horadatee', 'cookie', 'mac', 'machine', 'network', 'cable']
list_b = ['address','city']
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
print(c2.get('ip',0)**2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
counter1 = Counter(test)
counter2 = Counter(list_a)
counter3 = Counter(list_b)
score = counter_cosine_similarity(counter1,counter2)
print(score) # output : 0.4472135954999579
score = counter_cosine_similarity(counter1,counter3)
print(score) # output : 0.4999999999999999
对我来说,这不是我想要得到的分数,分数必须是相反的,因为 list_a 包含地址和 ip,所以它是 100% 测试匹配 我知道余弦相似度在这个中进行比较测试和 list_a 的情况,所以由于 list_a 上有一些元素不在测试中,这是因为分数很低,所以我会准确地比较测试与 list_a 以一种方式而不是以两种方式。
期望输出
score = counter_cosine_similarity(counter1,counter2)
print(score) # output : score higher than list_b = 1.0 may be
score = counter_cosine_similarity(counter1,counter3)
print(score) # output : score less the list_a = 0.5 may be
如果你想要更高的值,更多的条款是相同的,使用这个代码:
score = len(set(test).intersection(set(list_x)))
这会告诉您这两个列表有多少个常用术语。如果您想获得更高的重复分数,请尝试
commonTerms = set(test).intersection(set(list_x))
counter = Counter(list_x)
score = sum((counter.get(term) for term in commonTerms)) #edited
如果您需要将分数缩放到 [0..1],我需要了解更多关于您的数据集的信息。