使用 scikit-learn 查找 LDA 每个主题的文档数

Question

我正在关注 scikit-learn LDA 示例 here and am trying to understand how I can (if possible) surface how many documents have been labeled as having each one of these topics. I've been poring through the docs for the LDA model here，但不知道从哪里可以获得这个数字。以前有人用 scikit-learn 做到过吗？

Answer 1

LDA 计算每个文档的主题概率列表，因此您可能希望将文档的主题解释为主题该文档的概率最高。

如果 dtm 是您的文档术语矩阵并且 lda 您的 Latent Dirichlet Allocation 对象，您可以使用 transform() 函数和 pandas 探索主题混合：

docsVStopics = lda.transform(dtm)
docsVStopics = pd.DataFrame(docsVStopics, columns=["Topic"+str(i+1) for i in range(N_TOPICS)])
print("Created a (%dx%d) document-topic matrix." % (docsVStopics.shape[0], docsVStopics.shape[1]))
docsVStopics.head()

您可以轻松找到每个文档最有可能的主题：

most_likely_topics = docsVStopics.idxmax(axis=1)

然后获取计数：

 most_likely_topics.groupby(most_likely_topics).count()

使用 scikit-learn 查找 LDA 每个主题的文档数

finding number of documents per topic for LDA with scikit-learn

lda

scikit-learn