spaCy 如何为短语生成向量?

How does spaCy generate vectors for phrases?

spaCy 的中型和大型词汇表可以为单词和短语生成向量。让我们考虑以下示例:

import spacy

nlp = spacy.load("en_core_web_md")
tokens = nlp("apple cat sky")

print(tokens.text, tokens.vector[:3], tokens.vector_norm) #just only first three conmponents of vector 

for token in tokens:
    print(token.text, token.vector[:3], token.vector_norm)

输出:

apple cat sky [-0.06734333  0.03672066 -0.13952099] 4.845729844425328
apple [-0.36391  0.43771 -0.20447] 7.1346846
cat [-0.15067  -0.024468 -0.23368 ] 6.6808186
sky [ 0.31255  -0.30308   0.019587] 6.617719

显然词汇表包含每个单词的向量,但是整个阶段的向量是如何生成的? 正如你所看到的,它不仅仅是简单的向量求和。

默认情况下,aDoc的向量是token的向量的平均值,cfhttps://spacy.io/usage/vectors-similarity:

Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors.

在你的例子中,sentence(apple cat sky)的平均值可以计算为:

  • (-0.36391(苹果) + (-0.15067(猫)) + 0.31255(天空))/3 = -0.06734333

  • (0.43771(苹果) + (-0.024468(猫)) + -0.30308(天空))/3 = 0.03672066

  • (-0.20447(苹果) + (-0.23368(猫)) + 0.019587(天空))/3 = -0.13952099

不是向量之和而是向量的平均值

通过取标记向量的平均值,您可以找到一个句子的向量。