word2vec/gensim — RuntimeError: you must first build vocabulary before training the model

word2vec/gensim — RuntimeError: you must first build vocabulary before training the model

我在 .txt 文件上训练自己的 word2vec 模型时遇到问题。

代码:

import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors


# loading the .txt files

sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'): 
     with(open(doc, 'r')) as f:
        for line in f:
            line = line.rstrip()
            if line == "":
                if len(sentence) > 0:
                    sentences.append(sentence)
                    sentence = []
            else:
                cols = line.split("\t")
                if len(cols) > 4:
                    form = cols[1]
                    lemma = cols[2]
                    pos = cols[3]
                    if pos != "PONCT":
                        sentence.append(form.lower())


# trying to train the model

from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)

消息错误:

RuntimeError: you must first build vocabulary before training the model

我如何建立词汇表?

代码适用于示例 .conll 文件,但我想根据自己的数据训练模型。

您的 sentences 列表可能是空的。唯一添加任何内容的代码行要求 line 为空字符串,sentence 为 non-empty。也许那永远不会发生。

在创建模型之前检查 sentences 的值。确保它具有预期的长度、文本数量,并查看第一个(比如 sentences[0:2])以确保它们看起来不错。 sentences 中的每个项目本身应该是 list-of-strings。

如果不是,请调试读取文件并组装 sentences 序列的代码,直到看起来符合预期。

如果您在编辑此问题或后续问题时仍有问题,请务必:

  • 显示您收到的 整个 错误消息,包括显示文件名的 'traceback' 的所有行、lines-of-code 和 line-numbers
  • 详细描述您的语料库文件,例如其中一些内容的示例

感谢@gojomo 的建议和this answer,我解决了空白的sentences 问题。我需要以下代码块:

# make an iterator that reads your file one line at a time instead of reading everything in memory at once
# reads all the sentences

class SentenceIterator: 
    def __init__(self, filepath): 
        self.filepath = filepath 

    def __iter__(self): 
        for line in open(self.filepath): 
            yield line.split() 

训练模型之前:

# training the model

sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt') 
model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning 
                                                       # the internal dictionary. 
                                                       # Words that appear only once 
                                                       # in the corpus are probably 
                                                       # uninteresting typos and garbage. 
                                                       # In addition, there’s not enough 
                                                       # data to make any meaningful 
                                                       # training on those words, so it’s
                                                       # best to ignore them