KeyError: "word 'restrictions' not in vocabulary" while generating word embedding vectors for text, read from a textfile

KeyError: "word 'restrictions' not in vocabulary" while generating word embedding vectors for text, read from a textfile

我遇到这个错误:"KeyError: word 'restriction' not in vocabulary",当我读取一个文本文件以生成词嵌入向量时,而词 'restrictions' 在文本文件中。我想知道我读取文本文件(一个简单的段落)的代码是否有错误?

我的代码写在下面:

from gensim.models import Word2Vec
# define training data
with open('D:\test.txt', 'r') as file:
sentences = ""
#read from textfile
for line in file:
    for word in line.split(' '):
        sentences += word + ' '
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print(str(model['restriction']))

当我在代码中使用预先写好的句子时,不会出现这个错误:

from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],  
                ['this', 'is', 'the', 'second', 'sentence'],  
                ['yet', 'another', 'sentence'],  
                ['one', 'more', 'sentence', 'with', 'restriction'],
                ['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print('the model prints: ')
print(model['restriction'])

在显示问题的代码中,在构建它之后仔细检查 sentences,看看它是否是您期望的格式(或任何类似 sentences 的格式工作案例)。我怀疑不是。

此外,请查看令人失望的模型的学习单词列表 – 您创建的 words 变量应该足够了。它也可能看起来不像您期望的那样。

具体来说,您的这部分代码...

sentences = ""
for line in file:
    for word in line.split(' '):
        sentences += word + ' '

...使 sentences 成为一个长字符串,其中包含许多 space 分隔的单词。如果您对工作代码中的 sentences 执行此操作,您将不再有一个列表,其中每个项目都是一个标记列表。 (这是 Word2Vec 的一种很好的输入格式。)相反,你会有一个巨大的 运行-on string:

sentences = 'this is the first sentence for word2vec this is the second sentence yet another sentence one more sentence with restriction and the final sentence'

试试看:

sentences = []  # empty list
# OOPS, DON'T DO: sentences = ""
for line in file:
    sentences.append(line.split(' '))

...那么您的 sentences 将是一个字符串列表(如工作案例),而不仅仅是一个字符串(如损坏的案例)。