如何定义最佳主题数(k)?

How to define the optimal number of topics (k)?

我想知道这是为 LDA 提供给 gensim 的最佳主题编号 (k),我在 Whosebug 上找到了答案。但是,我收到了下面提到的错误。

这是 link 提供我找到的最佳主题数量的建议方法。

What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?

# import modules 

import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import LdaModel, CoherenceModel
from gensim import corpora

# make models with n k

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

# Considering 1-15 topics, as the last is cut off
num_topics = list(range(16)[1:])
num_keywords = 15

LDA_models = {}
LDA_topics = {}
for i in num_topics:
    LDA_models[i] = LdaModel(corpus=bow_corpus,
                             id2word=dirichlet_dict,
                             num_topics=i,
                             update_every=1,
                             chunksize=len(bow_corpus),
                             passes=20,
                             alpha='auto',
                             random_state=42)

    shown_topics = LDA_models[i].show_topics(num_topics=num_topics, 
                                             num_words=num_keywords,
                                             formatted=False)
    LDA_topics[i] = [[word[0] for word in topic[1]] for topic in shown_topics]

当我尝试实现代码时出现此错误:

-> 1145         if num_topics < 0 or num_topics >= self.num_topics:
   1146             num_topics = self.num_topics
   1147             chosen_topics = range(num_topics)

TypeError: '<' not supported between instances of 'list' and 'int'

这一行:

shown_topics = LDA_models[i].show_topics(num_topics=num_topics

应该是:

shown_topics = LDA_models[i].show_topics(num_topics=i

可以说,这是因为错误的变量命名。可以通过将 num_topics = list(range(16)[1:]) 和后续循环替换为:

来避免这种情况
max_topics = 15
for num_topics in range(1, max_topics+1):
    # use num_topics instead of i in the loop

这将消除可能的混淆