word2vec/gensim — RuntimeError: you must first build vocabulary before training the model

我在 .txt 文件上训练自己的 word2vec 模型时遇到问题。


import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors

# loading the .txt files

sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'): 
     with(open(doc, 'r')) as f:
        for line in f:
            line = line.rstrip()
            if line == "":
                if len(sentence) > 0:
                    sentence = []
                cols = line.split("\t")
                if len(cols) > 4:
                    form = cols[1]
                    lemma = cols[2]
                    pos = cols[3]
                    if pos != "PONCT":

# trying to train the model

from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)


RuntimeError: you must first build vocabulary before training the model


代码适用于示例 .conll 文件,但我想根据自己的数据训练模型。

您的 sentences 列表可能是空的。唯一添加任何内容的代码行要求 line 为空字符串,sentence 为 non-empty。也许那永远不会发生。

在创建模型之前检查 sentences 的值。确保它具有预期的长度、文本数量,并查看第一个(比如 sentences[0:2])以确保它们看起来不错。 sentences 中的每个项目本身应该是 list-of-strings。

如果不是,请调试读取文件并组装 sentences 序列的代码,直到看起来符合预期。


  • 显示您收到的 整个 错误消息,包括显示文件名的 'traceback' 的所有行、lines-of-code 和 line-numbers
  • 详细描述您的语料库文件,例如其中一些内容的示例

感谢@gojomo 的建议和this answer,我解决了空白的sentences 问题。我需要以下代码块:

# make an iterator that reads your file one line at a time instead of reading everything in memory at once
# reads all the sentences

class SentenceIterator: 
    def __init__(self, filepath): 
        self.filepath = filepath 

    def __iter__(self): 
        for line in open(self.filepath): 
            yield line.split() 


# training the model

sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt') 
model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning 
                                                       # the internal dictionary. 
                                                       # Words that appear only once 
                                                       # in the corpus are probably 
                                                       # uninteresting typos and garbage. 
                                                       # In addition, there’s not enough 
                                                       # data to make any meaningful 
                                                       # training on those words, so it’s
                                                       # best to ignore them