Spacy，两个句子之间的奇怪相似性

Question

我已经下载了 en_core_web_lg 模型并试图找出两个句子之间的相似性：

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

哪个returns很奇怪的值：

0.9066019751888448

这两句话不应该90%相似它们的意思大相径庭。

为什么会这样？是否需要添加某种额外的词汇才能使相似度结果更合理？

Answer 1

向量相似度的Spacy documentation解释了它的基本思想：
每个单词都有一个向量表示，通过上下文嵌入 (Word2Vec) 学习，这些在语料库上训练，如文档中所述。

现在，完整句子的词嵌入只是所有不同词的平均值。如果你现在有很多词在语义上位于同一区域（例如填充词，如 "he"、"was"、"this"、...），以及额外的词汇表 "cancels out"，那么您最终可能会得到与您的情况相似的结果。

问题是您可以对此做些什么：从我的角度来看，您可以想出一个更复杂的相似性度量。由于 search_doc 和 main_doc 有额外的信息，就像原始句子一样，您可以通过长度差异惩罚来修改向量，或者尝试比较句子的较短部分，并计算成对相似度（然后同样，问题是要比较哪些部分）。

遗憾的是，目前还没有简单的方法来简单地解决这个问题。

Answer 2

Spacy 通过平均词嵌入来构建句子嵌入。因为在一个普通的句子中，有很多无意义的词（称为 stop words），所以你得到的结果很差。您可以像这样删除它们：

search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

search_doc_no_stop_words = nlp(' '.join([str(t) for t in search_doc if not t.is_stop]))
main_doc_no_stop_words = nlp(' '.join([str(t) for t in main_doc if not t.is_stop]))

print(search_doc_no_stop_words.similarity(main_doc_no_stop_words))

或者只保留名词，因为它们拥有最多的信息：

doc_nouns = nlp(' '.join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']]))

Answer 3

正如@dennlinger 所指出的，Spacy 的句子嵌入只是所有单词向量嵌入的平均值。因此，如果你的句子中包含否定词，如 "good" 和 "bad"，它们的向量可能会相互抵消，从而导致上下文嵌入效果不佳。如果您的用例特定于获取句子嵌入，那么您应该尝试以下 SOTA 方法。

Google的通用句子编码器：https://tfhub.dev/google/universal-sentence-encoder/2
Facebook 的推理编码器：https://github.com/facebookresearch/InferSent

我已经尝试了这两种嵌入并且在大多数情况下都会给你很好的结果，并使用词嵌入作为构建句子嵌入的基础。

干杯！

Answer 4

如其他人所述，您可能想要使用 Universal Sentence Encoder 或 Infersent。

对于 Universal Sentence Encoder，您可以安装预构建的 SpaCy 模型来管理 TFHub 的包装，这样您只需要使用 pip 安装包，向量和相似度就会按预期工作。

您可以按照本库的说明进行操作（我是作者）https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub

安装模型：pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.3/en_use_md-0.4.3.tar.gz#en_use_md-0.4.3
加载并使用模型

import spacy
# this loads the wrapper
nlp = spacy.load('en_use_md')

# your sentences
search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))
# this will print 0.310783598221594

Spacy，两个句子之间的奇怪相似性

Spacy, Strange similarity between two sentences

python

nlp

spacy