gensim.interfaces.TransformedCorpus - 怎么用？

Question

我是 Latent Dirichlet Allocation 领域的新手。我能够按照维基百科教程生成 LDA 模型，并且能够使用我自己的文档生成 LDA 模型。我现在的步骤是尝试了解如何使用 previus 生成的模型对未见过的文档进行分类。我正在用

保存我的 "lda_wiki_model"

id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2')

    mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm')

    lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
    lda.save('lda_wiki_model.lda')

我正在加载相同的模型：

new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo

我有一个 "new_doc.txt"，我将我的文档转换成一个 id<-> 术语字典，并将这个标记化的文档转换成 "document-term matrix"

但是当我运行 new_topics = new_lda[corpus] 我收到一个 'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50'

如何从中提取主题？

我已经试过了

`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2)
corpus_lda = lsa[new_topics]
print(lsa.print_topics(num_topics=1, num_words=7)

和

print(corpus_lda.print_topics(num_topics=1, num_words=7) `

但是 return 主题与我的新文档无关。我的错误在哪里？我想念什么？

**如果运行使用上面创建的字典和语料库的新模型，我收到正确的主题，我的观点是：如何重新使用我的模型？是否正确地重新使用 wiki_model?

谢谢。

Answer 1

我遇到了同样的问题。此代码将解决您的问题：

new_topics = new_lda[corpus]

for topic in new_topics:

      print(topic)

这将为您提供形式为（主题编号，概率）的元组列表

Answer 2

来自 RaRe Technologies 人员准备的“Topics_and_Transformation.ipynb”教程：

Converting the entire corpus at the time of calling corpus_transformed = model[corpus] would mean storing the result in main memory, and that contradicts gensim’s objective of memory-independence.

If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

希望对您有所帮助。

Answer 3

这已经得到解答，但这里有一些代码供任何希望将未见文档的分类导出到 CSV 文件的人使用。

#Access the unseen corpus
corpus_test = [id2word.doc2bow(doc) for doc in data_test_lemmatized]

#Transform into LDA space based on old
lda_unseen = lda_model[corpus_test] 

#Print results, export to csv
for topic in lda_unseen:
      print(topic)

topic_probability = []
for t in lda_test:
      #print(t)
      topic_probability.append(t)

results_test = pd.DataFrame(topic_probability,columns=['Topic 1','Topic 2',
                                                       'Topic 3','Topic 4',
                                                       'Topic 5','Topic n'])

result_test.to_csv('test_results.csv', index=True, header=True)

代码灵感来自此。

gensim.interfaces.TransformedCorpus - 怎么用？

gensim.interfaces.TransformedCorpus - How use?

lda

gensim