Spark MLlib LDA，如何推断新的未见文档的主题分布？

Question

我有兴趣使用 Spark MLlib 应用 LDA 主题建模。我已经检查了 here 中的代码和解释，但我找不到如何使用该模型然后在一个新的看不见的文档中找到主题分布。

Answer 1

从 Spark 1.5 开始，DistributedLDAModel 尚未实现此功能。您需要做的是使用 toLocal 方法将您的模型转换为 LocalLDAModel，然后调用 topicDistributions(documents: RDD[(Long, Vector]) 方法，其中 documents 是新的（即输出-of-training) 文件，像这样：

newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

这将不如 this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al. 的 EM 算法准确，alpha 的优化对于获得好的主题非常重要。

Spark MLlib LDA，如何推断新的未见文档的主题分布？

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

lda

topic-modeling

apache-spark

apache-spark-mllib