如何获取参数'total_words' for model.train() of gensim's doc2vec

Question

你可能知道，当你制作一个 doc2vec 模型时，你可能会先做 model.build_vocab(corpus_file='...')，然后再做 model.train(corpus_file='...', total_examples=..., total_words=..., epochs=10).

我正在制作带有巨大维基百科数据文件的模型。因此，我必须为 train() 的参数指定 'total_examples' 和 'total_words'。 Gensim 的 Tutorial 说我可以得到第一个 total_examples=model.corpus_count。这可以。但是我不知道如何获得第二个，total_words。我可以从 model.build_vocab() 中看到最后一个日志中的总字数，如下所示。所以，我的目录放了数字，像total_words=1304592715，但我想指定它像model.corpus_count的方式。有人可以告诉我如何获得号码吗？谢谢，

:
2022-01-29 15:03:22,377 : INFO : PROGRESS: at example #1290000, processed 1253078267 words (6147969/s), 7881288 word types, 0 tags
2022-01-29 15:03:26,434 : INFO : PROGRESS: at example #1300000, processed 1277357579 words (5984975/s), 7959581 word types, 0 tags
2022-01-29 15:03:30,955 : INFO : collected 8039609 word types and 1309452 unique tags from a corpus of 1309452 examples and 1304592715 words
:

Answer 1

与model.corpus_count类似，提供给.build_vocab()的最后一个语料库中的单词总数应该在模型中缓存为model.corpus_total_words。

如何获取参数'total_words' for model.train() of gensim's doc2vec

How to obtain a parameter 'total_words' for model.train() of gensim's doc2vec

python

gensim

doc2vec