python 手套相似度计算

python glove similarity measure calculation

我正在尝试了解 python-glove 如何计算 most-similar 项。

是否使用余弦相似度?

示例来自 python-手套 github https://github.com/maciejkula/glove-python/tree/master/glove :

我从 gensim 的 word2vec 知道,most_similar 方法使用余弦距离计算相似度。

在手套项目网站上,对此进行了相当清晰的解释。 http://www-nlp.stanford.edu/projects/glove/

In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words.

要详细了解这背后的数学原理,请查看网站中的 "Model overview" 部分

project website在这一点上有点不清楚:

The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words.

欧氏距离与余弦相似度不同。听起来两者都很好用,但它没有指定使用哪个。

但是,我们可以观察 the source 您正在查看的 repo 以查看:

dst = (np.dot(self.word_vectors, word_vec)
       / np.linalg.norm(self.word_vectors, axis=1)
       / np.linalg.norm(word_vec))

它使用cosine similarity.

是的,它使用余弦相似度。

paper 在文本中提到:...通过首先对词汇表中的每个特征进行归一化,然后计算余弦相似度,从词向量中获得相似度分数....