如何从 Python Gensim 中的两个文档的主题分布比较它们之间的主题相似度？

Question

我使用 Gensim 在语料库上训练了 LDA 模型。现在我有了每个文档的主题分布，我如何比较两个文档在主题上的相似程度？我想有一个总结措施。例如，以下是两个文档的主题分布。总共有75个主题。为简洁起见，我只展示了概率最大的前 10 个主题（因此主题不分先后）。 (40, 0.5523168) 表示主题 #40 对于 DOC #1 的概率为 0.5523168。我应该计算两个向量之间的欧氏距离还是余弦距离？并且使用这个总结措施，我可以说，例如，DOC 1 与 DOC2 比与 DOC3 更相似，或者 DOC1 和 DOC 2 在局部上比 DOC 3 和 DOC 4 更相似？谢谢！

DOC #1:
[(40, 0.5523168), (60, 0.12225048), (43, 0.07556598), (41, 0.065885976), 
(22, 0.05838573), (24, 0.044774733), (74, 0.019839266), (65, 0.019544959), 
(51, 0.015470431), (36, 0.013449047)]


DOC #2:
[(73, 0.58864516), (41, 0.16827711), (51, 0.09783472), (63, 0.06510383), 
(24, 0.04722658), (32, 0.014467965), (44, 0.012267662), (47, 0.0031533625), 
(18, 0.0022214972), (0, 1.2154361e-05)]

Answer 1

Gensim 功能

Gensim 提供 similarities.docsim 功能 - "compute similarities across a collection of documents in the Vector Space Model." 您可以在 documentation here, there is also a tutorial 此处查看相似性查询。

文档相似性度量

使用欧几里德距离将是一个不常见的选择 - 您可以，但存在潜在问题。您可以使用余弦相似度 (link to python tutorial) - this takes the cosine of the angle of two document vectors, which has the advantage of being easily understood (1= the documents are perfectly alike, to -1=the documents have no similarity at all) and yes, you can compare the cosine similarity of documents 1 & 2 and compare it to that of documents 3 & 4, or calculate the similarity values of doc1 to doc2 and doc1 and doc3 and compare them. There is a pretty good tutorial here.

即使您的问题有些不同，您也可能会发现我对 this question over at CrossValidated 的回答很有帮助。

Gensim 还有其他 distance metrics 可用。这些几乎都包含在 gensim 的 matutils.

中

话题距离

您还可以使用上述 link 中的（某些）这些距离来测量主题之间的距离，例如 Hellinger 距离。

如何从 Python Gensim 中的两个文档的主题分布比较它们之间的主题相似度？

How to compare the topical similarity between two documents in Python Gensim from their topic distributions?

python

lda

gensim