KeyError: "word 'restrictions' not in vocabulary" while generating word embedding vectors for text, read from a textfile
KeyError: "word 'restrictions' not in vocabulary" while generating word embedding vectors for text, read from a textfile
我遇到这个错误:"KeyError: word 'restriction' not in vocabulary",当我读取一个文本文件以生成词嵌入向量时,而词 'restrictions' 在文本文件中。我想知道我读取文本文件(一个简单的段落)的代码是否有错误?
我的代码写在下面:
from gensim.models import Word2Vec
# define training data
with open('D:\test.txt', 'r') as file:
sentences = ""
#read from textfile
for line in file:
for word in line.split(' '):
sentences += word + ' '
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print(str(model['restriction']))
当我在代码中使用预先写好的句子时,不会出现这个错误:
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence', 'with', 'restriction'],
['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print('the model prints: ')
print(model['restriction'])
在显示问题的代码中,在构建它之后仔细检查 sentences
,看看它是否是您期望的格式(或任何类似 sentences
的格式工作案例)。我怀疑不是。
此外,请查看令人失望的模型的学习单词列表 – 您创建的 words
变量应该足够了。它也可能看起来不像您期望的那样。
具体来说,您的这部分代码...
sentences = ""
for line in file:
for word in line.split(' '):
sentences += word + ' '
...使 sentences
成为一个长字符串,其中包含许多 space 分隔的单词。如果您对工作代码中的 sentences
执行此操作,您将不再有一个列表,其中每个项目都是一个标记列表。 (这是 Word2Vec
的一种很好的输入格式。)相反,你会有一个巨大的 运行-on string:
sentences = 'this is the first sentence for word2vec this is the second sentence yet another sentence one more sentence with restriction and the final sentence'
试试看:
sentences = [] # empty list
# OOPS, DON'T DO: sentences = ""
for line in file:
sentences.append(line.split(' '))
...那么您的 sentences
将是一个字符串列表(如工作案例),而不仅仅是一个字符串(如损坏的案例)。
我遇到这个错误:"KeyError: word 'restriction' not in vocabulary",当我读取一个文本文件以生成词嵌入向量时,而词 'restrictions' 在文本文件中。我想知道我读取文本文件(一个简单的段落)的代码是否有错误?
我的代码写在下面:
from gensim.models import Word2Vec
# define training data
with open('D:\test.txt', 'r') as file:
sentences = ""
#read from textfile
for line in file:
for word in line.split(' '):
sentences += word + ' '
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print(str(model['restriction']))
当我在代码中使用预先写好的句子时,不会出现这个错误:
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence', 'with', 'restriction'],
['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
print('the model prints: ')
print(model['restriction'])
在显示问题的代码中,在构建它之后仔细检查 sentences
,看看它是否是您期望的格式(或任何类似 sentences
的格式工作案例)。我怀疑不是。
此外,请查看令人失望的模型的学习单词列表 – 您创建的 words
变量应该足够了。它也可能看起来不像您期望的那样。
具体来说,您的这部分代码...
sentences = ""
for line in file:
for word in line.split(' '):
sentences += word + ' '
...使 sentences
成为一个长字符串,其中包含许多 space 分隔的单词。如果您对工作代码中的 sentences
执行此操作,您将不再有一个列表,其中每个项目都是一个标记列表。 (这是 Word2Vec
的一种很好的输入格式。)相反,你会有一个巨大的 运行-on string:
sentences = 'this is the first sentence for word2vec this is the second sentence yet another sentence one more sentence with restriction and the final sentence'
试试看:
sentences = [] # empty list
# OOPS, DON'T DO: sentences = ""
for line in file:
sentences.append(line.split(' '))
...那么您的 sentences
将是一个字符串列表(如工作案例),而不仅仅是一个字符串(如损坏的案例)。