文本（余弦）相似度

Question

我遵循了 Fred Foo 在这个堆栈溢出问题中的解释：How to compute the similarity between two text documents?

我有运行他写的下面一段代码：

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away",
          "Never compare an apple to an orange",
          "I prefer scikit-learn to Orange",
          "The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
print(pairwise_similarity.toarray())

结果是：

[[1.         0.17668795 0.27056873 0.         0.        ]
 [0.17668795 1.         0.15439436 0.         0.        ]
 [0.27056873 0.15439436 1.         0.19635649 0.16815247]
 [0.         0.         0.19635649 1.         0.54499756]
 [0.         0.         0.16815247 0.54499756 1.        ]]

但我注意到，当我将语料库设置为：

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away"]

和运行同样的代码，我得到了矩阵：

[[1.         0.19431434]
 [0.19431434 1.        ]]

因此它们的相似度发生变化（在第一个矩阵中，它们的相似度为0.17668795）。为什么会这样？我真的很困惑。提前致谢！

Answer 1

在维基百科中你可以看到如何计算Tf-idf

N - 语料库中的文档数。

所以相似度取决于语料库中所有 documents/sentences 的数量。

如果你有更多 documents/sentences 那么它会改变结果。

如果您添加相同的 document/sentence 几次，那么它也会改变结果。

文本（余弦）相似度

Text (cosine) similarity

python

text-processing

nlp

machine-learning