如何在 Gensim 中打印文档明智的主题？

Question

我将 LDA 与 gensim 一起用于主题建模。我的数据有 23 个文档，我想为每个文档单独 topics/words 但 gensim 正在为整个文档集提供主题。个人文档如何获取？

dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using 
#dictionary prepared above.

corpus = [dictionary.doc2bow(doc) for doc in doc_clean]


# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)

result=ldamodel.print_topics(num_topics=3, num_words=3)

这是我得到的输出：

[(0, '0.011*"plex" + 0.010*"game" + 0.009*"racing"'),
(1, '0.008*"app" + 0.008*"live" + 0.007*"share"'),
(2, '0.015*"device" + 0.009*"file" + 0.008*"movie"')]

Answer 1

print_topics() return 是一个主题列表，加载到该主题的词和那些词。

如果您想要每个文档的主题加载量，则需要使用 get_document_topics()。

来自gensim documentation：

get_document_topics(bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False)

获取给定文档的主题分布。

参数： bow (corpus : list of (int, float)) – BOW 格式的文档。 minimum_probability (float) – 指定概率低于此阈值的主题将被丢弃。 minimum_phi_value (float) – 如果 per_word_topics 是 True，这表示包含的术语概率的下限。如果设置为 None，则使用值 1e-8 来防止出现 0。 per_word_topics (bool) – 如果 True，此函数还将 return 两个额外的列表，如“Returns”部分所述。

Returns:
list of (int, float) – 整个文档的主题分布。列表中的每个元素都是一对主题的 id，以及分配给它的概率。

list of (int, list of (int, float)，可选——每个词最可能的主题。列表中的每个元素都是一对单词的 id，以及按与该单词的相关性排序的主题列表。只有 returned 如果 per_word_topics 被设置为 True.

list of (int, list of float)，可选 - Phi 相关值，乘以特征长度，用于每个单词-主题组合。列表中的每个元素都是一对单词的 id 和该单词与每个主题之间的 phi 值列表。只有 returned 如果 per_word_topics 被设置为 True.

get_term_topics() 和 get_topic_terms() 也可能对您感兴趣。

Answer 2

如果我没理解错的话，你需要把整个事情放在一个循环里然后做 print_topics():

您的文档示例：

doc1 = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc2 = "My mother spends a lot of time driving my brother around to baseball practice."
doc3 = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_set = [doc_a, doc_b, doc_c]

现在你的循环必须遍历你的 doc_set:

for i in doc_set:
      ##### after all the cleaning in these steps, append to a list #####

      dictionary = corpora.Dictionary(doc_clean)
      corpus = [dictionary.doc2bow(doc) for doc in doc_clean]

      ##### set the num_topics you want for each document, I set one for now #####

      ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 1, id2word = dictionary, passes=20)
      for i in ldamodel.print_topics():
          print(i)
          print('\n')

示例输出：

(0, '0.200*"brocolli" + 0.200*"eat" + 0.200*"good" + 0.133*"brother" + 0.133*"like" + 0.133*"mother"')


(0, '0.097*"brocolli" + 0.097*"eat" + 0.097*"good" + 0.097*"mother" + 0.097*"brother" + 0.065*"lot" + 0.065*"spend" + 0.065*"practic" + 0.065*"around" + 0.065*"basebal"')


(0, '0.060*"drive" + 0.060*"eat" + 0.060*"good" + 0.060*"mother" + 0.060*"brocolli" + 0.060*"brother" + 0.040*"pressur" + 0.040*"health" + 0.040*"caus" + 0.040*"increas"')

如何在 Gensim 中打印文档明智的主题？

How can I print document wise topics in Gensim?

python

nltk

lda

gensim

topic-modeling