Word2Vec 返回单个字符而不是单词的向量
Word2Vec returning vectors for individual character and not words
对于以下列表:
words= ['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA','unimodal','7','regarding','random','59','intimating','COMPETITION','prospects','2K15','gather','Mega','SENSOR','NCTT','NETWORKING','orgainsed','acts']
我尝试:
from gensim.models import Word2Vec
vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']
哪个 returns:
KeyError: "word 'gather' not in vocabulary"
但是
vec_model['g']
return 是向量吗,所以请相信我 return 是在列表中找到字符的所有向量,而不是列表中找到的所有单词的向量。
Word2Vec 需要一个列表列表作为输入,其中语料库(主列表)由单个文档组成。单个文档由单个单词(标记)组成。 Word2Vec 遍历所有文档和所有标记。在您的示例中,您已将单个列表传递给 Word2Vec,因此 Word2Vec 将每个单词解释为一个单独的文档,并遍历每个被解释为标记的单词字符。因此,您建立的是字符而不是单词的词汇表。要构建单词词汇表,您可以将嵌套列表传递给 Word2Vec,如下例所示。
from gensim.models import Word2Vec
words= [['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA'],
['unimodal','7','regarding','random','59','intimating'],
['COMPETITION','prospects','2K15','gather','Mega'],
['SENSOR','NCTT','NETWORKING','orgainsed','acts']]
vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']
输出:
array([ 0.01106581, 0.00968017, -0.00090574, 0.01115612, -0.00766465,
-0.01648632, -0.01455364, 0.01107104, 0.00769841, 0.01037362,
0.01551551, -0.01188449, 0.01262331, 0.01608987, 0.01484082,
0.00528397, 0.01613582, 0.00437328, 0.00372362, 0.00480989,
-0.00299072, -0.00261444, 0.00282137, -0.01168992, -0.01402746,
-0.01165612, 0.00088562, 0.01581018, -0.00671618, -0.00698833],
dtype=float32)
对于以下列表:
words= ['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA','unimodal','7','regarding','random','59','intimating','COMPETITION','prospects','2K15','gather','Mega','SENSOR','NCTT','NETWORKING','orgainsed','acts']
我尝试:
from gensim.models import Word2Vec
vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']
哪个 returns:
KeyError: "word 'gather' not in vocabulary"
但是
vec_model['g']
return 是向量吗,所以请相信我 return 是在列表中找到字符的所有向量,而不是列表中找到的所有单词的向量。
Word2Vec 需要一个列表列表作为输入,其中语料库(主列表)由单个文档组成。单个文档由单个单词(标记)组成。 Word2Vec 遍历所有文档和所有标记。在您的示例中,您已将单个列表传递给 Word2Vec,因此 Word2Vec 将每个单词解释为一个单独的文档,并遍历每个被解释为标记的单词字符。因此,您建立的是字符而不是单词的词汇表。要构建单词词汇表,您可以将嵌套列表传递给 Word2Vec,如下例所示。
from gensim.models import Word2Vec
words= [['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA'],
['unimodal','7','regarding','random','59','intimating'],
['COMPETITION','prospects','2K15','gather','Mega'],
['SENSOR','NCTT','NETWORKING','orgainsed','acts']]
vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']
输出:
array([ 0.01106581, 0.00968017, -0.00090574, 0.01115612, -0.00766465,
-0.01648632, -0.01455364, 0.01107104, 0.00769841, 0.01037362,
0.01551551, -0.01188449, 0.01262331, 0.01608987, 0.01484082,
0.00528397, 0.01613582, 0.00437328, 0.00372362, 0.00480989,
-0.00299072, -0.00261444, 0.00282137, -0.01168992, -0.01402746,
-0.01165612, 0.00088562, 0.01581018, -0.00671618, -0.00698833],
dtype=float32)