KeyError: "word 'restrictions' not in vocabulary" while generating word embedding vectors for text, read from a textfile

Question

我遇到这个错误："KeyError: word 'restriction' not in vocabulary"，当我读取一个文本文件以生成词嵌入向量时，而词 'restrictions' 在文本文件中。我想知道我读取文本文件（一个简单的段落）的代码是否有错误？

我的代码写在下面：

from gensim.models import Word2Vec
# define training data
with open('D:\test.txt', 'r') as file:
sentences = ""
#read from textfile
for line in file:
    for word in line.split(' '):
        sentences += word + ' '
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print(str(model['restriction']))

当我在代码中使用预先写好的句子时，不会出现这个错误：

from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],  
                ['this', 'is', 'the', 'second', 'sentence'],  
                ['yet', 'another', 'sentence'],  
                ['one', 'more', 'sentence', 'with', 'restriction'],
                ['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print('the model prints: ')
print(model['restriction'])

Answer 1

在显示问题的代码中，在构建它之后仔细检查 sentences，看看它是否是您期望的格式（或任何类似 sentences 的格式工作案例）。我怀疑不是。

此外，请查看令人失望的模型的学习单词列表 – 您创建的 words 变量应该足够了。它也可能看起来不像您期望的那样。

具体来说，您的这部分代码...

sentences = ""
for line in file:
    for word in line.split(' '):
        sentences += word + ' '

...使 sentences 成为一个长字符串，其中包含许多 space 分隔的单词。如果您对工作代码中的 sentences 执行此操作，您将不再有一个列表，其中每个项目都是一个标记列表。（这是 Word2Vec 的一种很好的输入格式。）相反，你会有一个巨大的运行-on string:

sentences = 'this is the first sentence for word2vec this is the second sentence yet another sentence one more sentence with restriction and the final sentence'

试试看：

sentences = []  # empty list
# OOPS, DON'T DO: sentences = ""
for line in file:
    sentences.append(line.split(' '))

...那么您的 sentences 将是一个字符串列表（如工作案例），而不仅仅是一个字符串（如损坏的案例）。

KeyError: "word 'restrictions' not in vocabulary" while generating word embedding vectors for text, read from a textfile

KeyError: "word 'restrictions' not in vocabulary" while generating word embedding vectors for text, read from a textfile

python

text-files

word2vec

deep-learning