Word2Vec 返回单个字符而不是单词的向量

Word2Vec returning vectors for individual character and not words

对于以下列表:

words= ['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA','unimodal','7','regarding','random','59','intimating','COMPETITION','prospects','2K15','gather','Mega','SENSOR','NCTT','NETWORKING','orgainsed','acts']

我尝试:

from gensim.models import Word2Vec
vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']

哪个 returns:

KeyError: "word 'gather' not in vocabulary"

但是

vec_model['g']

return 是向量吗,所以请相信我 return 是在列表中找到字符的所有向量,而不是列表中找到的所有单词的向量。

Word2Vec 需要一个列表列表作为输入,其中语料库(主列表)由单个文档组成。单个文档由单个单词(标记)组成。 Word2Vec 遍历所有文档和所有标记。在您的示例中,您已将单个列表传递给 Word2Vec,因此 Word2Vec 将每个单词解释为一个单独的文档,并遍历每个被解释为标记的单词字符。因此,您建立的是字符而不是单词的词汇表。要构建单词词汇表,您可以将嵌套列表传递给 Word2Vec,如下例所示。

from gensim.models import Word2Vec

words= [['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA'],
['unimodal','7','regarding','random','59','intimating'],
['COMPETITION','prospects','2K15','gather','Mega'],
['SENSOR','NCTT','NETWORKING','orgainsed','acts']]

vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']

输出:

array([ 0.01106581,  0.00968017, -0.00090574,  0.01115612, -0.00766465,
       -0.01648632, -0.01455364,  0.01107104,  0.00769841,  0.01037362,
        0.01551551, -0.01188449,  0.01262331,  0.01608987,  0.01484082,
        0.00528397,  0.01613582,  0.00437328,  0.00372362,  0.00480989,
       -0.00299072, -0.00261444,  0.00282137, -0.01168992, -0.01402746,
       -0.01165612,  0.00088562,  0.01581018, -0.00671618, -0.00698833],
      dtype=float32)