如何定义最佳主题数(k)?
How to define the optimal number of topics (k)?
我想知道这是为 LDA 提供给 gensim 的最佳主题编号 (k),我在 Whosebug 上找到了答案。但是,我收到了下面提到的错误。
这是 link 提供我找到的最佳主题数量的建议方法。
What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?
# import modules
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import LdaModel, CoherenceModel
from gensim import corpora
# make models with n k
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
# Considering 1-15 topics, as the last is cut off
num_topics = list(range(16)[1:])
num_keywords = 15
LDA_models = {}
LDA_topics = {}
for i in num_topics:
LDA_models[i] = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=i,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto',
random_state=42)
shown_topics = LDA_models[i].show_topics(num_topics=num_topics,
num_words=num_keywords,
formatted=False)
LDA_topics[i] = [[word[0] for word in topic[1]] for topic in shown_topics]
当我尝试实现代码时出现此错误:
-> 1145 if num_topics < 0 or num_topics >= self.num_topics:
1146 num_topics = self.num_topics
1147 chosen_topics = range(num_topics)
TypeError: '<' not supported between instances of 'list' and 'int'
这一行:
shown_topics = LDA_models[i].show_topics(num_topics=num_topics
应该是:
shown_topics = LDA_models[i].show_topics(num_topics=i
可以说,这是因为错误的变量命名。可以通过将 num_topics = list(range(16)[1:])
和后续循环替换为:
来避免这种情况
max_topics = 15
for num_topics in range(1, max_topics+1):
# use num_topics instead of i in the loop
这将消除可能的混淆
我想知道这是为 LDA 提供给 gensim 的最佳主题编号 (k),我在 Whosebug 上找到了答案。但是,我收到了下面提到的错误。
这是 link 提供我找到的最佳主题数量的建议方法。
What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?
# import modules
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import LdaModel, CoherenceModel
from gensim import corpora
# make models with n k
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
# Considering 1-15 topics, as the last is cut off
num_topics = list(range(16)[1:])
num_keywords = 15
LDA_models = {}
LDA_topics = {}
for i in num_topics:
LDA_models[i] = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=i,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto',
random_state=42)
shown_topics = LDA_models[i].show_topics(num_topics=num_topics,
num_words=num_keywords,
formatted=False)
LDA_topics[i] = [[word[0] for word in topic[1]] for topic in shown_topics]
当我尝试实现代码时出现此错误:
-> 1145 if num_topics < 0 or num_topics >= self.num_topics:
1146 num_topics = self.num_topics
1147 chosen_topics = range(num_topics)
TypeError: '<' not supported between instances of 'list' and 'int'
这一行:
shown_topics = LDA_models[i].show_topics(num_topics=num_topics
应该是:
shown_topics = LDA_models[i].show_topics(num_topics=i
可以说,这是因为错误的变量命名。可以通过将 num_topics = list(range(16)[1:])
和后续循环替换为:
max_topics = 15
for num_topics in range(1, max_topics+1):
# use num_topics instead of i in the loop
这将消除可能的混淆