如何计算数组之间的相似性？

Question

我正在尝试计算两个样本之间的相似度。 python 函数 sklearn.metrics.pairwise.cosine_similarity 和 scipy.spatial.distance.cosine return 个我不满意的结果。例如：

在下文中我预计为 0.0%，因为这两个样本没有相同的样本。

 tt1 = [1, 16, 4, 21]
 tt2 = [5, 17, 3, 22]

 from scipy import spatial
 res = 1-spatial.distance.cosine(tt1, tt2)
 print(res)
 0.9893593529663931

我预计相似度为 0.25%，因为只有一个样本，第一个 (1)，在两个数组中是相同的。

 tt1 = [1, 16, 4, 21]
 tt2 = [1, 17, 3, 22]

 from scipy import spatial
 res = 1-spatial.distance.cosine(tt1, tt2)
 print(res)
 0.9990578001169402

此处预期为 0.75%。三个相同的样本（1、16 和 4）

 tt1 = [1, 16, 4, 21]
 tt2 = [1, 16, 4, 22]
 res = 0.9997474232272052

在 python 中有没有办法实现这些预期结果？

Answer 1

那些向量在几何上非常接近。余弦相似度不仅衡量元素是否相同，还衡量它们的不同程度。

您似乎只想要一个元素匹配率？

sum([t1 == t2 for t1, t2 in zip(tt1, tt2)]) / len(tt1)
# or
np.equal(tt1, tt2).mean()

Answer 2

我认为您误解了函数的计算内容。根据您的描述，您想要计算错误分类错误/准确度。但是，该函数接收两个样本 u、v 并计算它们之间的余弦距离。在您的第一个示例中：

tt1 = [1, 16, 4, 21]
tt2 = [5, 17, 3, 22]

然后 u=tt1 和 v=tt2。两个数组的不同值是这些样本所在的向量 space 中的坐标（这里是 4 维 space）——而不是不同的样本。参考 function documentation 并具体参考底部的示例。

如果这些数组中的每个坐标代表一个不同的样本，那么：

如果顺序无关紧要：

 len(np.intersect1d(np.array(tt1), np.array(tt2))) / len(tt1)

Answer 3

您可以按照 documentation

中的说明使用 numpy.intersect1d

这是我如何使用示例 4#

的示例

import numpy as np 
tt1 = [1, 16, 4, 21]
tt2 = [1, 16, 4, 22]
res = len(np.intersect1d(tt1, tt2)) / ((len(tt1)+len(tt2))/2)
print(res)

How to compute similarities between arrays?