如何定义最佳主题数（k）？

Question

我想知道这是为 LDA 提供给 gensim 的最佳主题编号 (k)，我在 Whosebug 上找到了答案。但是，我收到了下面提到的错误。

这是 link 提供我找到的最佳主题数量的建议方法。

What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?

# import modules 

import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import LdaModel, CoherenceModel
from gensim import corpora

# make models with n k

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

# Considering 1-15 topics, as the last is cut off
num_topics = list(range(16)[1:])
num_keywords = 15

LDA_models = {}
LDA_topics = {}
for i in num_topics:
    LDA_models[i] = LdaModel(corpus=bow_corpus,
                             id2word=dirichlet_dict,
                             num_topics=i,
                             update_every=1,
                             chunksize=len(bow_corpus),
                             passes=20,
                             alpha='auto',
                             random_state=42)

    shown_topics = LDA_models[i].show_topics(num_topics=num_topics, 
                                             num_words=num_keywords,
                                             formatted=False)
    LDA_topics[i] = [[word[0] for word in topic[1]] for topic in shown_topics]

当我尝试实现代码时出现此错误：

-> 1145         if num_topics < 0 or num_topics >= self.num_topics:
   1146             num_topics = self.num_topics
   1147             chosen_topics = range(num_topics)

TypeError: '<' not supported between instances of 'list' and 'int'

Answer 1

这一行：

shown_topics = LDA_models[i].show_topics(num_topics=num_topics

应该是：

shown_topics = LDA_models[i].show_topics(num_topics=i

可以说，这是因为错误的变量命名。可以通过将 num_topics = list(range(16)[1:]) 和后续循环替换为：

来避免这种情况

max_topics = 15
for num_topics in range(1, max_topics+1):
    # use num_topics instead of i in the loop

这将消除可能的混淆

如何定义最佳主题数（k）？

How to define the optimal number of topics (k)?

python

python-3.x

gensim