从 Pyspark LDA 模型中提取文档主题矩阵

Question

我已经通过 Python API:

在 spark 中成功训练了一个 LDA 模型

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

这完全没问题，但我现在需要 document-LDA 模型的主题矩阵，但据我所知，我能得到的只是 word-topic, using model.topicsMatrix().

有没有什么方法可以从 LDA 模型中获取文档-主题矩阵，如果没有，Spark 中是否有替代方法（除了从头开始实现 LDA）到运行 LDA 模型会给我需要的结果吗？

编辑：

仔细研究后，我在 Java api 中找到了 DistributedLDAModel 的文档，其中有一个 topicDistributions() 我认为正是我需要的在这里（但我 100% 确定 Pyspark 中的 LDAModel 实际上是底层的 DistributedLDAModel...）。

无论如何，我可以像这样间接调用这个方法，没有任何明显的失败：

In [127]: model.call('topicDistributions')
Out[127]: MapPartitionsRDD[3156] at mapPartitions at PythonMLLibAPI.scala:1480

但如果我真的查看结果，我得到的只是字符串，告诉我结果实际上是一个 Scala 元组（我认为）：

In [128]: model.call('topicDistributions').take(5)
Out[128]:
[{u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'}]

也许这通常是正确的方法，但有没有办法获得实际结果？

Answer 1

经过广泛研究，这在当前版本的 Spark (1.5.1) 上绝对不可能通过 Python api 实现。但在 Scala 中，它相当简单（给定一个要训练的 RDD documents）：

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}

// first generate RDD of documents...

val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)
val ldaModel = lda.run(documents)

# then convert to distributed LDA model
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]

然后获取文档主题分布很简单：

distLDAModel.topicDistributions

Answer 2

从 Spark 2.0 开始，您可以使用 transform() 作为 pyspark.ml.clustering.DistributedLDAModel 的方法。我刚刚在 scikit-learn 的 20 个新闻组数据集上尝试了这个并且它有效。查看返回的 vectors，这是文档主题的分布。

>>> test_results = ldaModel.transform(wordVecs)
Row(filename='/home/jovyan/work/data/20news_home/20news-bydate-test/rec.autos/103343', target=7, text='I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.', tokens=['little', 'confused', 'models', 'bonnevilles', 'someone', 'differences', 'features', 'performance', 'curious', 'prefereably', 'usually', 'demand', 'spring', 'summer'], vectors=SparseVector(10977, {28: 1.0, 29: 1.0, 152: 1.0, 301: 1.0, 496: 1.0, 552: 1.0, 571: 1.0, 839: 1.0, 1114: 1.0, 1281: 1.0, 1288: 1.0, 1624: 1.0}), topicDistribution=DenseVector([0.0462, 0.0538, 0.045, 0.0473, 0.0545, 0.0487, 0.0529, 0.0535, 0.0467, 0.0549, 0.051, 0.0466, 0.045, 0.0487, 0.0482, 0.0509, 0.054, 0.0472, 0.0547, 0.0501]))

Answer 3

以下扩展了 PySpark 和 Spark 2.0 的上述响应。

请原谅我将此作为回复而不是评论发布，但我现在缺少代表。

我假设你有一个训练有素的 LDA 模型，它是由这样的语料库制成的：

lda = LDA(k=NUM_TOPICS, optimizer="em")
ldaModel = lda.fit(corpus) # Where corpus is a dataframe with 'features'.

为了将文档转换为主题分布，我们创建了文档 ID 的数据框和单词的向量（越稀疏越好）。

documents = spark.createDataFrame([
    [123myNumericId, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count}],
    [2, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count, another:1.0}],
], schema=["id", "features"]
transformed = ldaModel.transform(documents)
dist = transformed.take(1)
# dist[0]['topicDistribution'] is now a dense vector of our topics.

从 Pyspark LDA 模型中提取文档主题矩阵

Extract document-topic matrix from Pyspark LDA Model

python

lda

apache-spark

pyspark