计算单词之间的余弦相似度
Calculate cosine similarity between words
如果我们有两个字符串列表:
A = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
B = "bank, weather, sun, moon, fun, hi".split(",")
列表A
中的单词构成了我的词向量基础。
如何计算B中每个词的余弦相似度得分?
到目前为止我做了什么:
我可以使用以下函数计算两个完整列表的余弦相似度:
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
但是我如何整合我的向量基,然后如何计算 B 中的项之间的相似性?
import math
from collections import Counter
ListA = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
ListB = "bank, weather, sun, moon, fun, hi".split(",")
def cosdis(v1, v2):
common = v1[1].intersection(v2[1])
return sum(v1[0][ch] * v2[0][ch] for ch in common) / v1[2] / v2[2]
def word2vec(word):
cw = Counter(word)
sw = set(cw)
lw = math.sqrt(sum(c * c for c in cw.values()))
return cw, sw, lw
def removePunctuations(str_input):
ret = []
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for char in str_input:
if char not in punctuations:
ret.append(char)
return "".join(ret)
for i in ListA:
for j in ListB:
print(cosdis(word2vec(removePunctuations(i)), word2vec(removePunctuations(j))))
如果我们有两个字符串列表:
A = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
B = "bank, weather, sun, moon, fun, hi".split(",")
列表A
中的单词构成了我的词向量基础。
如何计算B中每个词的余弦相似度得分?
到目前为止我做了什么: 我可以使用以下函数计算两个完整列表的余弦相似度:
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
但是我如何整合我的向量基,然后如何计算 B 中的项之间的相似性?
import math
from collections import Counter
ListA = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
ListB = "bank, weather, sun, moon, fun, hi".split(",")
def cosdis(v1, v2):
common = v1[1].intersection(v2[1])
return sum(v1[0][ch] * v2[0][ch] for ch in common) / v1[2] / v2[2]
def word2vec(word):
cw = Counter(word)
sw = set(cw)
lw = math.sqrt(sum(c * c for c in cw.values()))
return cw, sw, lw
def removePunctuations(str_input):
ret = []
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for char in str_input:
if char not in punctuations:
ret.append(char)
return "".join(ret)
for i in ListA:
for j in ListB:
print(cosdis(word2vec(removePunctuations(i)), word2vec(removePunctuations(j))))