word2vec/gensim — RuntimeError: you must first build vocabulary before training the model
word2vec/gensim — RuntimeError: you must first build vocabulary before training the model
我在 .txt
文件上训练自己的 word2vec
模型时遇到问题。
代码:
import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors
# loading the .txt files
sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'):
with(open(doc, 'r')) as f:
for line in f:
line = line.rstrip()
if line == "":
if len(sentence) > 0:
sentences.append(sentence)
sentence = []
else:
cols = line.split("\t")
if len(cols) > 4:
form = cols[1]
lemma = cols[2]
pos = cols[3]
if pos != "PONCT":
sentence.append(form.lower())
# trying to train the model
from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)
消息错误:
RuntimeError: you must first build vocabulary before training the model
我如何建立词汇表?
代码适用于示例 .conll
文件,但我想根据自己的数据训练模型。
您的 sentences
列表可能是空的。唯一添加任何内容的代码行要求 line
为空字符串,sentence
为 non-empty。也许那永远不会发生。
在创建模型之前检查 sentences
的值。确保它具有预期的长度、文本数量,并查看第一个(比如 sentences[0:2]
)以确保它们看起来不错。 sentences
中的每个项目本身应该是 list-of-strings。
如果不是,请调试读取文件并组装 sentences
序列的代码,直到看起来符合预期。
如果您在编辑此问题或后续问题时仍有问题,请务必:
- 显示您收到的 整个 错误消息,包括显示文件名的 'traceback' 的所有行、lines-of-code 和 line-numbers
- 详细描述您的语料库文件,例如其中一些内容的示例
感谢@gojomo 的建议和this answer,我解决了空白的sentences
问题。我需要以下代码块:
# make an iterator that reads your file one line at a time instead of reading everything in memory at once
# reads all the sentences
class SentenceIterator:
def __init__(self, filepath):
self.filepath = filepath
def __iter__(self):
for line in open(self.filepath):
yield line.split()
训练模型之前:
# training the model
sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt')
model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning
# the internal dictionary.
# Words that appear only once
# in the corpus are probably
# uninteresting typos and garbage.
# In addition, there’s not enough
# data to make any meaningful
# training on those words, so it’s
# best to ignore them
我在 .txt
文件上训练自己的 word2vec
模型时遇到问题。
代码:
import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors
# loading the .txt files
sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'):
with(open(doc, 'r')) as f:
for line in f:
line = line.rstrip()
if line == "":
if len(sentence) > 0:
sentences.append(sentence)
sentence = []
else:
cols = line.split("\t")
if len(cols) > 4:
form = cols[1]
lemma = cols[2]
pos = cols[3]
if pos != "PONCT":
sentence.append(form.lower())
# trying to train the model
from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)
消息错误:
RuntimeError: you must first build vocabulary before training the model
我如何建立词汇表?
代码适用于示例 .conll
文件,但我想根据自己的数据训练模型。
您的 sentences
列表可能是空的。唯一添加任何内容的代码行要求 line
为空字符串,sentence
为 non-empty。也许那永远不会发生。
在创建模型之前检查 sentences
的值。确保它具有预期的长度、文本数量,并查看第一个(比如 sentences[0:2]
)以确保它们看起来不错。 sentences
中的每个项目本身应该是 list-of-strings。
如果不是,请调试读取文件并组装 sentences
序列的代码,直到看起来符合预期。
如果您在编辑此问题或后续问题时仍有问题,请务必:
- 显示您收到的 整个 错误消息,包括显示文件名的 'traceback' 的所有行、lines-of-code 和 line-numbers
- 详细描述您的语料库文件,例如其中一些内容的示例
感谢@gojomo 的建议和this answer,我解决了空白的sentences
问题。我需要以下代码块:
# make an iterator that reads your file one line at a time instead of reading everything in memory at once
# reads all the sentences
class SentenceIterator:
def __init__(self, filepath):
self.filepath = filepath
def __iter__(self):
for line in open(self.filepath):
yield line.split()
训练模型之前:
# training the model
sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt')
model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning
# the internal dictionary.
# Words that appear only once
# in the corpus are probably
# uninteresting typos and garbage.
# In addition, there’s not enough
# data to make any meaningful
# training on those words, so it’s
# best to ignore them