使用相同来源的余弦相似性和完全不同的结果

Cosine similarities and totally different results using same source

我正在学习词嵌入和余弦相似度。我的数据由两组相同的单词组成，但使用两种不同的语言。

我做了两个测试：

我使用词向量的平均值来测量余弦相似度（我认为它应该被称为软余弦相似度）
我使用词向量测量了余弦相似度

我应该期望获得完全相同的结果吗？我注意到有时我会得到两个相反的结果。由于我是新手，所以我想弄清楚我是否做错了什么或者背后是否有解释。根据我一直在阅读的内容，软余弦相似度应该比通常的余弦相似度更准确。

现在，是时候让一些数据向您展示了。不幸的是，我无法 post 我的部分数据（文字本身），但我会尽力为您提供我能提供给您的最大信息。

之前的一些其他细节：

我正在使用 FastText 创建嵌入，skipgram 模型默认参数。
对于软余弦相似度，我使用Scipy 空间距离余弦。按照一些人的建议，为了衡量余弦相似度，我似乎应该从公式中减去 1，例如：

(1-distance.cosine(data['LANG1_AVG'].iloc[i],data['LANG2_AVG'].iloc[i]))

对于通常的余弦相似度我使用的是Fast Vector cosine similarity from FastText Multilingual，这样定义：

@classmethod def cosine_similarity(cls, vec_a, vec_b): """Compute cosine similarity between vec_a and vec_b""" return np.dot(vec_a, vec_b) / \ (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))

正如您将从此处的图像中看到的那样，对于某些单词，我使用这两种方法获得了相同或非常相似的结果。对于其他人，我得到了两个完全不同的结果。我该如何解释呢？

据我了解，两个向量 x 和 y 之间的软相似性由 (avg(x) * avg(y)) / (abs(avg(x)) * abs(avg(y)) 给出) = sign(avg(x) * avg(y))，它是 1 或 -1，具体取决于平均值是否具有相同的符号。这可能不是很有帮助。

余弦相似度的计算公式为 (x * y) / (||x|| * ||y||)。指向相同方向的 2 个向量的相似度为 1 (x * x = ||x||^2)，指向相反方向的 2 个向量的相似度为 -1 (x * -x = -||x ||^2) 和 2 个垂直向量相似度为 0 ((1,0)*(0,1)=0)。如果向量之间的角度不等于 0、90、180 或 270 之一，则相似度得分介于（但不等于）-1 和 1 之间。

底线：忘掉平均值，只使用余弦相似度。请注意，余弦相似度比较的是方向而不是向量的长度。

PS："able" 的法语翻译是 "capable" 而不是 "able" ;)

经过一些额外的研究，我发现了一篇 2014 年的论文（Soft Similarity and Soft Cosine Measure： Vector Space 模型) 中特征的相似性解释了何时以及如何使用特征的平均值，并且还解释了软余弦度量的确切含义：

Our idea is more general: we propose to modify the manner of calculation of similarity in Vector Space Model taking into account similarity of features. If we apply this idea to the cosine measure, then the “soft cosine measure” is introduced, as opposed to traditional “hard cosine”, which ignores similarity of features. Note that when we consider similarity of each pair of features, it is equivalent to introducing new features in the VSM. Essentially, we have a matrix of similarity between pairs of features and all these features represent new dimensions in the VSM.

使用相同来源的余弦相似性和完全不同的结果

Cosine similarities and totally different results using same source

python

machine-learning

machine-translation

cosine-similarity

word-embedding