获取每个主题最可能的词

Question

我用 sklearn 制作了一个 LDA 模型，但是，尽管听起来很奇怪，但我在网上找不到任何关于如何获得热门词的信息。这是我的代码：

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
doc_term_matrix = count_vect.fit_transform(tweet_tp['text'].values.astype('U'))
doc_term_matrix


from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=3, random_state=1)
id_topic = LDA.fit(doc_term_matrix)

一旦我添加了这个：

import numpy as np
vocab = count_vect.get_feature_names()

topic_words = {}
for topic, comp in enumerate(LDA.components_):
    word_idx = np.argsort(comp)[::-1][:5]

topic_words[topic] = [vocab[i] for i in word_idx]

for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

这是我在此处的答案中找到的，但目前找不到。然而，这只输出第二个主题词。

Answer 1

您可以像这样使用 ntopwlst:

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
doc_term_matrix = count_vect.fit_transform(tweet_tp['text'].values.astype('U'))

from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=3, random_state=1)
id_topic = LDA.fit(doc_term_matrix)

def ntopwlst(model, features, ntopwords):
    '''create a list of the top topc words'''
    output = []
    for topic_idx, topic in enumerate(model.components_): # compose output message with top words
        output.append(str(topic_idx))
        output += [features[i] for i in topic.argsort()[:-ntopwords - 1:-1]] # [start (0 if omitted): end : slicing increment]
    return output

ntopwords = 5 # change this to show more words for the topic selector (20)
tf_feature_names = count_vect.get_feature_names()
topwds = ntopwlst(LDA, tf_feature_names, ntopwords)

您确实提取了词汇表，但这比直接处理 LDA 结果更容易。我无法对此进行测试，因为我缺少 tweet_tp 数据，因此请谨慎使用。

获取每个主题最可能的词

Get most probable words for each topic

python

lda

topic-modeling

scikit-learn