有没有一种简单的方法可以告诉 SpaCy 在使用 .similarity 方法时忽略停用词？

Question

所以现在我有一个非常简单的程序，它将接受一个句子并在给定的书中找到语义最相似的句子，然后打印出该句子以及接下来的几个句子。

import spacy
nlp = spacy.load('en_core_web_lg')

#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()

alice = nlp(text)

sentences = list(alice.sents)

mysent = nlp(unicode("example sentence, could be whatever"))

best_match = None
best_similarity_value = 0
for sent in sentences:
    similarity = sent.similarity(mysent)
    if similarity > best_similarity_value:
        best_similarity_value = similarity
        best_match = sent

print sentences[sentences.index(best_match):sentences.index(best_match)+10]

我想通过告诉 SpaCy 在执行此过程时忽略停用词来获得更好的结果，但我不知道执行此操作的最佳方法。就像我可以创建一个新的空白列表并将每个不是停用词的单词附加到列表中

for sentence in sentences:
    for word in sentence:
        if word.is_stop == 'False':
            newlist.append(word)

但我必须让它比上面的代码更复杂，因为我必须保持原始句子列表的完整性（因为如果我想打印出完整的，索引必须相同后面的句子）。另外，如果我这样做，我将不得不运行通过 SpaCy 返回这个新的列表列表，以便使用 .similarity 方法。

我觉得一定有更好的方法来解决这个问题，我非常感谢任何指导。即使没有比将每个不间断词附加到新列表更好的方法，我也会感谢您在创建列表列表方面提供的任何帮助，以便索引与原始 "sentences" 变量相同。

非常感谢！

Answer 1

您需要做的是覆盖 spaCy 计算相似度的方式。

对于相似度计算，spaCy 首先通过对每个标记（token.vector 属性）的向量进行平均计算每个文档的向量，然后通过执行余弦相似度：

return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

你必须稍微调整一下，不要考虑停用词向量。

以下代码应该适合您：

import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")

def compute_similarity(doc1, doc2):
    vector1 = np.zeros(300)
    vector2 = np.zeros(300)
    for token in doc1:
        if (token.text not in STOP_WORDS):
            vector1 = vector1 + token.vector
    vector1 = np.divide(vector1, len(doc1))
    for token in doc2:
        if (token.text not in STOP_WORDS):
            vector2 = vector2 + token.vector
    vector2 = np.divide(vector2, len(doc2))
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

print(compute_similarity(doc1, doc2)))

希望对您有所帮助！

Answer 2

这里有一个稍微更优雅的解决方案：我们将覆盖 spacy 计算文档向量的方式 under-the-hood，这会将此自定义传播到任何下游管道组件，如 TextCategorizer 或其他任何组件。

这基于此处找到的文档：https://spacy.io/usage/processing-pipelines#custom-components-user-hooks

此解决方案是围绕加载 pre-trained 嵌入而设计的。代替直接引用停用词列表，我只是假设我加载的嵌入的 out-of-vocab 是我想在文档向量计算中忽略的标记。

class FancyDocumentVectors(object):
    def __call__(self, doc):
        doc.user_hooks["vector"] = self.vector
        return doc

    def vector(self, doc):
        """
        Constrain attention to non-zero vectors.
        Returns concatenation of mean and max pooling
        """
        # This is the part where we filter out stop words 
        # (really any token for which we couldn't calculate a vector representation).
        # If you'd rather invoke a stopword list, change the line below to something like:
        # doc_vecs = np.array([t.vector for t in doc if t in STOPWORDS])
        doc_vecs = np.array([t.vector for t in doc if t.has_vector])
        if sum(doc_vecs.shape) == 0: 
            doc_vecs = np.array([doc[0].vector])

        mean_pooled = doc_vecs.mean(axis=0)
        
        # Because I'm fancy, I'm going to augment my custom document vector with 
        # some additional information. For a demonstration of the value of this 
        # approach, reference the SWEM paper: https://arxiv.org/abs/1805.09843
        max_pooled = doc_vecs.max(axis=0)
        doc_vec = np.hstack([mean_pooled, max_pooled])
        return doc_vec

        # If you're not into it, just return mean_pooled instead.
        # return mean_pooled

nlp.add_pipe(FancyDocumentVectors())

这是一个使用在 Whosebug 上训练的向量的具体示例！

首先，我们将预训练的嵌入加载到一个空的语言模型中。

import spacy
from gensim.models.keyedvectors import KeyedVectors

# https://github.com/vefstathiou/SO_word2vec
word_vect = KeyedVectors.load_word2vec_format("SO_vectors_200.bin", binary=True)
nlp = spacy.blank('en')
nlp.vocab.vectors = spacy.vocab.Vectors(data=word_vect.syn0, keys=word_vect.index2word)

更改任何内容之前的默认行为：

doc = nlp("This is a question about spacy.")
for token in doc:
  print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())

# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 4.353660220883036 -6.901098

覆盖文档向量计算后的修改行为：

# MAGIC!
nlp.add_pipe(FancyDocumentVectors())

doc = nlp("This is a question about spacy.")
for token in doc:
  print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())

# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 24.601780061609414 109.74769

有没有一种简单的方法可以告诉 SpaCy 在使用 .similarity 方法时忽略停用词？

Is there a simple way to tell SpaCy to ignore stop words when using .similarity method?

python

nlp

spacy