在 Python gensim 主题模型中访问字典

Question

我想看看如何从 gensim lda 主题模型访问字典。当您训练 lda 模型、稍后保存和加载它时，这一点尤其重要。换句话说，假设 lda_model 是在文档集合上训练的模型。要获得文档-主题矩阵，可以执行类似下面的操作或类似 https://www.kdnuggets.com/2019/09/overview-topics-extraction-python-latent-dirichlet-allocation.html:

中解释的操作

def regTokenize(text):
    # tokenize the text into words
    import re
    WORD = re.compile(r'\w+')
    words = WORD.findall(text)
    return words

from gensim.corpora.dictionary import Dictionary
ttext = [regTokenize(d) for d in text]  
dic = Dictionary(ttext)
ttext = [dic.doc2bow(text) for text in ttext]
ttext = lda_model.get_document_topics(ttext)

但是，经过训练的字典 lda_model 可能与新数据不同，最后一行会出现错误，例如：

"IndexError: index 41021 is out of bounds for axis 1 with size 41021"

有什么方法（或参数）可以从受过训练的 lda_model 中获取字典，而不是使用它来代替 dic = Dictionary(ttext)？非常感谢您的帮助和回答！

Answer 1

一般方法应该是使用 Dictionary.save method and read it back for reuse using Dictionary.load.

将训练模型时创建的字典存储到文件中

只有这样 Dictionary.token2id 才能保持不变，并且可以用于将 ID 映射到单词，反之亦然用于预训练模型。

在 Python gensim 主题模型中访问字典

Access dictionary in Python gensim topic model

python

dictionary

lda

gensim

topic-modeling