Python,LDA:如何使用 Gensim 获取关键字的 id 而不是关键字本身?
Python, LDA : How to get the id of keywords instead of the keywords themselves with Gensim?
我正在使用 Gensim 应用 LDA 方法从文档中提取关键字。
我可以提取主题,然后将这些主题和关键词关联到文档中。
我想要这些术语(或关键词)的 ID 而不是术语本身。我知道 corpus[i]
提取文档 i
的一对 [(term_id, term_frequency) ...] 列表,但我看不出如何在我的代码仅提取 ID 并将其分配给我的结果。
我的代码如下:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=passes, minimum_probability=0)
# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print(threshold)
for t in range(len(topic_tuple)):
key_words.append([topic_tuple[t][j][0] for j in range(num_words)])
df_key_words = pd.DataFrame({'key_words' : key_words})
documents_corpus.append([j for i,j in zip(lda_corpus,doc_set) if i[t][1] > threshold])
df_documents_corpus = pd.DataFrame({'documents_corpus' : documents_corpus})
documents_corpus_id.append([i for d,i in zip(lda_corpus, doc_set_id) if d[t][1] > threshold])
df_documents_corpus_id = pd.DataFrame({'documents_corpus_id' : documents_corpus_id})
result.append(pd.concat([df_key_words, df_documents_corpus, df_documents_corpus_id ], axis=1))
提前谢谢你,问我是否需要更多信息。
如果有人遇到与我相同的问题,这里是反向映射的答案:
reverse_map = dict((ldamodel.id2word[id],id) for id in ldamodel.id2word)
感谢 bigdeeperadvisors
我正在使用 Gensim 应用 LDA 方法从文档中提取关键字。 我可以提取主题,然后将这些主题和关键词关联到文档中。
我想要这些术语(或关键词)的 ID 而不是术语本身。我知道 corpus[i]
提取文档 i
的一对 [(term_id, term_frequency) ...] 列表,但我看不出如何在我的代码仅提取 ID 并将其分配给我的结果。
我的代码如下:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=passes, minimum_probability=0)
# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print(threshold)
for t in range(len(topic_tuple)):
key_words.append([topic_tuple[t][j][0] for j in range(num_words)])
df_key_words = pd.DataFrame({'key_words' : key_words})
documents_corpus.append([j for i,j in zip(lda_corpus,doc_set) if i[t][1] > threshold])
df_documents_corpus = pd.DataFrame({'documents_corpus' : documents_corpus})
documents_corpus_id.append([i for d,i in zip(lda_corpus, doc_set_id) if d[t][1] > threshold])
df_documents_corpus_id = pd.DataFrame({'documents_corpus_id' : documents_corpus_id})
result.append(pd.concat([df_key_words, df_documents_corpus, df_documents_corpus_id ], axis=1))
提前谢谢你,问我是否需要更多信息。
如果有人遇到与我相同的问题,这里是反向映射的答案:
reverse_map = dict((ldamodel.id2word[id],id) for id in ldamodel.id2word)
感谢 bigdeeperadvisors