表情符号矢量来自 spacy

Question

简而言之，spacy 中的表情符号向量？这记录在哪里？


import spacy
nlp = spacy.load('en_core_web_sm')

a = ""
b = "❄️"
v = ""
h = ""
l = ""
e = [a,b,v,h,l]

# emoji vector
ev = [nlp(emoji).vector for emoji in e]

# numpy array
ev = np.array(ev)

ev.shape

形状是(5, 96)，所以我很好奇在哪里可以了解更多有关向量来源的信息。起初，我以为这些是 OOV，但是：

ev.sum(axis=1)

产量

array([2.906692 , 3.8687153, 1.2295313, 3.986846 , 1.9255924],
      dtype=float32)

All above is via Colab environment as of 2/21/2021

Answer 1

sm 模型不包含词向量。如果没有任何词向量，token.vector returns token.tensor 作为退避，这是来自 tagger 分量的上下文敏感张量。在此处查看第一个警告框：https://v2.spacy.io/usage/vectors-similarity

如果您想要词向量，请改用 md 或 lg 模型，然后表情符号将是 OOV，token.vector 将 return 为全 0 300d 矢量。

表情符号矢量来自 spacy

Emoji vectors via spacy

emoji

spacy