spaCy 如何为短语生成向量?
How does spaCy generate vectors for phrases?
spaCy 的中型和大型词汇表可以为单词和短语生成向量。让我们考虑以下示例:
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp("apple cat sky")
print(tokens.text, tokens.vector[:3], tokens.vector_norm) #just only first three conmponents of vector
for token in tokens:
print(token.text, token.vector[:3], token.vector_norm)
输出:
apple cat sky [-0.06734333 0.03672066 -0.13952099] 4.845729844425328
apple [-0.36391 0.43771 -0.20447] 7.1346846
cat [-0.15067 -0.024468 -0.23368 ] 6.6808186
sky [ 0.31255 -0.30308 0.019587] 6.617719
显然词汇表包含每个单词的向量,但是整个阶段的向量是如何生成的?
正如你所看到的,它不仅仅是简单的向量求和。
默认情况下,aDoc
的向量是token的向量的平均值,cfhttps://spacy.io/usage/vectors-similarity:
Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors.
在你的例子中,sentence(apple cat sky)的平均值可以计算为:
(-0.36391(苹果) + (-0.15067(猫)) + 0.31255(天空))/3 = -0.06734333
(0.43771(苹果) + (-0.024468(猫)) + -0.30308(天空))/3 = 0.03672066
(-0.20447(苹果) + (-0.23368(猫)) + 0.019587(天空))/3 = -0.13952099
不是向量之和而是向量的平均值
通过取标记向量的平均值,您可以找到一个句子的向量。
spaCy 的中型和大型词汇表可以为单词和短语生成向量。让我们考虑以下示例:
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp("apple cat sky")
print(tokens.text, tokens.vector[:3], tokens.vector_norm) #just only first three conmponents of vector
for token in tokens:
print(token.text, token.vector[:3], token.vector_norm)
输出:
apple cat sky [-0.06734333 0.03672066 -0.13952099] 4.845729844425328
apple [-0.36391 0.43771 -0.20447] 7.1346846
cat [-0.15067 -0.024468 -0.23368 ] 6.6808186
sky [ 0.31255 -0.30308 0.019587] 6.617719
显然词汇表包含每个单词的向量,但是整个阶段的向量是如何生成的? 正如你所看到的,它不仅仅是简单的向量求和。
默认情况下,aDoc
的向量是token的向量的平均值,cfhttps://spacy.io/usage/vectors-similarity:
Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors.
在你的例子中,sentence(apple cat sky)的平均值可以计算为:
(-0.36391(苹果) + (-0.15067(猫)) + 0.31255(天空))/3 = -0.06734333
(0.43771(苹果) + (-0.024468(猫)) + -0.30308(天空))/3 = 0.03672066
(-0.20447(苹果) + (-0.23368(猫)) + 0.019587(天空))/3 = -0.13952099
不是向量之和而是向量的平均值
通过取标记向量的平均值,您可以找到一个句子的向量。