使用 scikit-learn 查找 LDA 每个主题的文档数
finding number of documents per topic for LDA with scikit-learn
我正在关注 scikit-learn LDA 示例 here and am trying to understand how I can (if possible) surface how many documents have been labeled as having each one of these topics. I've been poring through the docs for the LDA model here,但不知道从哪里可以获得这个数字。以前有人用 scikit-learn 做到过吗?
LDA 计算每个文档的主题概率列表,因此您可能希望将文档的主题解释为主题该文档的概率最高。
如果 dtm
是您的文档术语矩阵并且 lda
您的 Latent Dirichlet Allocation 对象,您可以使用 transform()
函数和 pandas
探索主题混合:
docsVStopics = lda.transform(dtm)
docsVStopics = pd.DataFrame(docsVStopics, columns=["Topic"+str(i+1) for i in range(N_TOPICS)])
print("Created a (%dx%d) document-topic matrix." % (docsVStopics.shape[0], docsVStopics.shape[1]))
docsVStopics.head()
您可以轻松找到每个文档最有可能的主题:
most_likely_topics = docsVStopics.idxmax(axis=1)
然后获取计数:
most_likely_topics.groupby(most_likely_topics).count()
我正在关注 scikit-learn LDA 示例 here and am trying to understand how I can (if possible) surface how many documents have been labeled as having each one of these topics. I've been poring through the docs for the LDA model here,但不知道从哪里可以获得这个数字。以前有人用 scikit-learn 做到过吗?
LDA 计算每个文档的主题概率列表,因此您可能希望将文档的主题解释为主题该文档的概率最高。
如果 dtm
是您的文档术语矩阵并且 lda
您的 Latent Dirichlet Allocation 对象,您可以使用 transform()
函数和 pandas
探索主题混合:
docsVStopics = lda.transform(dtm)
docsVStopics = pd.DataFrame(docsVStopics, columns=["Topic"+str(i+1) for i in range(N_TOPICS)])
print("Created a (%dx%d) document-topic matrix." % (docsVStopics.shape[0], docsVStopics.shape[1]))
docsVStopics.head()
您可以轻松找到每个文档最有可能的主题:
most_likely_topics = docsVStopics.idxmax(axis=1)
然后获取计数:
most_likely_topics.groupby(most_likely_topics).count()