如何运行spaCy的句子相似度函数对字符串数组得到分数数组？

Question

我必须将一个 spacy 文档与一系列 spacy 文档进行比较，并希望获得相似度分数列表作为输出。当然，我可以使用 for 循环来完成此操作，但我正在寻找一些优化的解决方案，例如 numpy offers to broadcast 等。

我有一个文档对应一个文档列表：

oneDoc = 'Hello, I want to be compared with a list of documents'
listDocs = ["I'm the first one", "I'm the second one"]

spaCy 为我们提供了文档相似度函数：

oneDoc = nlp(oneDoc)
listDocs = nlp(listDocs)
similarity_score = np.zeros(len(listDocs))
for i, doc in enumerate(listDocs):
    similarity_score[i] = oneDoc.similarity(doc)

由于将一个文档与两个文档的列表进行比较，相似度分数将是这样的： [0.7, 0.8]

我正在寻找一种方法来避免这种 for 循环。换句话说，我想向量化这个函数。

Answer 1

使用nlp.pipe 处理您所有的文本文档。从每个文档中获取嵌入 .vector。应用以余弦为度量的 numpy 成对距离函数来创建矩阵。

如何运行spaCy的句子相似度函数对字符串数组得到分数数组？

How to run spaCy's sentence similarity function to an array of strings to get an array of scores?

python

nlp

similarity

vectorization

spacy