Gensim LDA 连贯性分数南

Question

我创建了一个 Gensim LDA 模型，如本教程所示：https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

lda_model = gensim.models.LdaMulticore(data_df['bow_corpus'], num_topics=10, id2word=dictionary, random_state=100, chunksize=100, passes=10, per_word_topics=True)

它生成 10 个主题，log_perplexity 为：

lda_model.log_perplexity(data_df['bow_corpus']) = -5.325966117835991

但是当我运行它上面的一致性模型来计算一致性分数时，像这样：

coherence_model_lda = CoherenceModel(model=lda_model, texts=data_df['bow_corpus'].tolist(), dictionary=dictionary, coherence='c_v')
with np.errstate(invalid='ignore'):
    lda_score = coherence_model_lda.get_coherence()

我的 LDA 分数很低。我在这里做错了什么？

Answer 1

已解决！ Coherence 模型需要原始文本，而不是训练语料库 LDA_Model - 所以当我运行 this:

coherence_model_lda = CoherenceModel(model=lda_model, texts=data_df['corpus'].tolist(), dictionary=dictionary, coherence='c_v')
with np.errstate(invalid='ignore'):
    lda_score = coherence_model_lda.get_coherence()

I got a coherence score of: 0.462

希望这可以帮助其他人犯同样的错误。谢谢！

Answer 2

文档 (https://radimrehurek.com/gensim/models/coherencemodel.html) 说要提供“标记化文本”（str 列表的列表）- 这些应该是您的文本，拆分为您提供给 CoherenceModel 的词典中的单个单词。如果您提供未标记化的全文，则查找字典中没有该词的条目。

Gensim LDA 连贯性分数南

Gensim LDA Coherence Score Nan

python

machine-learning

lda

gensim

topic-modeling