如何使用gensim LDA获取文档的完整主题分布?
How to get a complete topic distribution for a document using gensim LDA?
当我这样训练我的 lda 模型时
dictionary = corpora.Dictionary(data)
corpus = [dictionary.doc2bow(doc) for doc in data]
num_cores = multiprocessing.cpu_count()
num_topics = 50
lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary,
workers=num_cores, alpha=1e-5, eta=5e-1)
我想为所有 num_topics
获取每个文档的完整主题分布。也就是说,在这种特殊情况下,我希望每个文档都有 50 个主题有助于分布 and 我希望能够访问所有 50主题的贡献。如果严格遵守 LDA 的数学原理,这个输出就是 LDA 应该做的。然而,gensim只输出超过一定阈值的主题,如图here。例如,如果我尝试
lda[corpus[89]]
>>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)]
仅显示对文档 89 贡献最大的 3 个主题。我已经尝试了上面 link 中的解决方案,但这对我不起作用。我仍然得到相同的输出:
theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]
产生相同的输出,即每个文档只有 2,3 个主题。
我的问题是如何更改此阈值以便我可以访问 FULL 主题分布 每个文件?无论主题对文档的贡献多么微不足道,我如何才能访问完整的主题分布?我想要完整分发的原因是我可以在文档分发之间执行 KL similarity 搜索。
提前致谢
似乎还没有人回复,所以我会尽力回答这个问题 documentation。
您似乎需要在训练模型时将参数 minimum_probability
设置为 0.0 才能获得所需的结果:
lda = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1,
minimum_probability=0.0)
lda[corpus[233]]
>>> [(0, 5.8821799358842424e-07),
(1, 5.8821799358842424e-07),
(2, 5.8821799358842424e-07),
(3, 5.8821799358842424e-07),
(4, 5.8821799358842424e-07),
(5, 5.8821799358842424e-07),
(6, 5.8821799358842424e-07),
(7, 5.8821799358842424e-07),
(8, 5.8821799358842424e-07),
(9, 5.8821799358842424e-07),
(10, 5.8821799358842424e-07),
(11, 5.8821799358842424e-07),
(12, 5.8821799358842424e-07),
(13, 5.8821799358842424e-07),
(14, 5.8821799358842424e-07),
(15, 5.8821799358842424e-07),
(16, 5.8821799358842424e-07),
(17, 5.8821799358842424e-07),
(18, 5.8821799358842424e-07),
(19, 5.8821799358842424e-07),
(20, 5.8821799358842424e-07),
(21, 5.8821799358842424e-07),
(22, 5.8821799358842424e-07),
(23, 5.8821799358842424e-07),
(24, 5.8821799358842424e-07),
(25, 5.8821799358842424e-07),
(26, 5.8821799358842424e-07),
(27, 0.99997117731831464),
(28, 5.8821799358842424e-07),
(29, 5.8821799358842424e-07),
(30, 5.8821799358842424e-07),
(31, 5.8821799358842424e-07),
(32, 5.8821799358842424e-07),
(33, 5.8821799358842424e-07),
(34, 5.8821799358842424e-07),
(35, 5.8821799358842424e-07),
(36, 5.8821799358842424e-07),
(37, 5.8821799358842424e-07),
(38, 5.8821799358842424e-07),
(39, 5.8821799358842424e-07),
(40, 5.8821799358842424e-07),
(41, 5.8821799358842424e-07),
(42, 5.8821799358842424e-07),
(43, 5.8821799358842424e-07),
(44, 5.8821799358842424e-07),
(45, 5.8821799358842424e-07),
(46, 5.8821799358842424e-07),
(47, 5.8821799358842424e-07),
(48, 5.8821799358842424e-07),
(49, 5.8821799358842424e-07)]
以防对其他人有帮助:
在训练好你的LDA模型后,如果你想获取文档的所有主题,不限制下限阈值,你应该在调用get_document_topics方法时将minimum_probability设置为0。
ldaModel.get_document_topics(bagOfWordOfADocument, minimum_probability=0.0)
当我这样训练我的 lda 模型时
dictionary = corpora.Dictionary(data)
corpus = [dictionary.doc2bow(doc) for doc in data]
num_cores = multiprocessing.cpu_count()
num_topics = 50
lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary,
workers=num_cores, alpha=1e-5, eta=5e-1)
我想为所有 num_topics
获取每个文档的完整主题分布。也就是说,在这种特殊情况下,我希望每个文档都有 50 个主题有助于分布 and 我希望能够访问所有 50主题的贡献。如果严格遵守 LDA 的数学原理,这个输出就是 LDA 应该做的。然而,gensim只输出超过一定阈值的主题,如图here。例如,如果我尝试
lda[corpus[89]]
>>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)]
仅显示对文档 89 贡献最大的 3 个主题。我已经尝试了上面 link 中的解决方案,但这对我不起作用。我仍然得到相同的输出:
theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]
产生相同的输出,即每个文档只有 2,3 个主题。
我的问题是如何更改此阈值以便我可以访问 FULL 主题分布 每个文件?无论主题对文档的贡献多么微不足道,我如何才能访问完整的主题分布?我想要完整分发的原因是我可以在文档分发之间执行 KL similarity 搜索。
提前致谢
似乎还没有人回复,所以我会尽力回答这个问题 documentation。
您似乎需要在训练模型时将参数 minimum_probability
设置为 0.0 才能获得所需的结果:
lda = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1,
minimum_probability=0.0)
lda[corpus[233]]
>>> [(0, 5.8821799358842424e-07),
(1, 5.8821799358842424e-07),
(2, 5.8821799358842424e-07),
(3, 5.8821799358842424e-07),
(4, 5.8821799358842424e-07),
(5, 5.8821799358842424e-07),
(6, 5.8821799358842424e-07),
(7, 5.8821799358842424e-07),
(8, 5.8821799358842424e-07),
(9, 5.8821799358842424e-07),
(10, 5.8821799358842424e-07),
(11, 5.8821799358842424e-07),
(12, 5.8821799358842424e-07),
(13, 5.8821799358842424e-07),
(14, 5.8821799358842424e-07),
(15, 5.8821799358842424e-07),
(16, 5.8821799358842424e-07),
(17, 5.8821799358842424e-07),
(18, 5.8821799358842424e-07),
(19, 5.8821799358842424e-07),
(20, 5.8821799358842424e-07),
(21, 5.8821799358842424e-07),
(22, 5.8821799358842424e-07),
(23, 5.8821799358842424e-07),
(24, 5.8821799358842424e-07),
(25, 5.8821799358842424e-07),
(26, 5.8821799358842424e-07),
(27, 0.99997117731831464),
(28, 5.8821799358842424e-07),
(29, 5.8821799358842424e-07),
(30, 5.8821799358842424e-07),
(31, 5.8821799358842424e-07),
(32, 5.8821799358842424e-07),
(33, 5.8821799358842424e-07),
(34, 5.8821799358842424e-07),
(35, 5.8821799358842424e-07),
(36, 5.8821799358842424e-07),
(37, 5.8821799358842424e-07),
(38, 5.8821799358842424e-07),
(39, 5.8821799358842424e-07),
(40, 5.8821799358842424e-07),
(41, 5.8821799358842424e-07),
(42, 5.8821799358842424e-07),
(43, 5.8821799358842424e-07),
(44, 5.8821799358842424e-07),
(45, 5.8821799358842424e-07),
(46, 5.8821799358842424e-07),
(47, 5.8821799358842424e-07),
(48, 5.8821799358842424e-07),
(49, 5.8821799358842424e-07)]
以防对其他人有帮助:
在训练好你的LDA模型后,如果你想获取文档的所有主题,不限制下限阈值,你应该在调用get_document_topics方法时将minimum_probability设置为0。
ldaModel.get_document_topics(bagOfWordOfADocument, minimum_probability=0.0)