使用 Spacy 查找文档中最相似的句子

Question

我正在寻找一种解决方案来使用 Gensim 中的 most_similar()，但使用 Spacy。我想使用 NLP 在句子列表中找到最相似的句子。

我试着循环使用Spacy中的similarity()（例如https://spacy.io/api/doc#similarity），但是需要很长时间。

更深入：

我想将所有这些句子放在一个图表中（如 this）以找到句子簇。

有什么想法吗？

Answer 1

这是一个简单的内置解决方案，您可以使用：

import spacy

nlp = spacy.load("en_core_web_lg")
text = (
    "Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
    " These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
    " The term semantic similarity is often confused with semantic relatedness."
    " Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
    " My favorite fruit is apples."
)
doc = nlp(text)
max_similarity = 0.0
most_similar = None, None
for i, sent in enumerate(doc.sents):
    for j, other in enumerate(doc.sents):
        if j <= i:
            continue
        similarity = sent.similarity(other)
        if similarity > max_similarity:
            max_similarity = similarity
            most_similar = sent, other
print("Most similar sentences are:")
print(f"-> '{most_similar[0]}'")
print("and")
print(f"-> '{most_similar[1]}'")
print(f"with a similarity of {max_similarity}")

（文字来自 wikipedia）

它将产生以下输出：

Most similar sentences are:
-> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
and
-> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
with a similarity of 0.9583859443664551

请注意来自 spacy.io 的以下信息：

To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:
- python -m spacy download en_core_web_sm
+ python -m spacy download en_core_web_lg

另请参阅以获取有关如何提高相似度分数的建议。

使用 Spacy 查找文档中最相似的句子

Use Spacy to find most similar sentences in doc

similarity

gensim

sentence-similarity

spacy

doc2vec