在理解 MLlib 中的 LDA 主题模型时遇到麻烦

Question

我在理解 Spark Mlib 中的 LDA 主题模型结果时遇到一些问题。

据我了解，我们将得到如下结果：

 Topic 1: term1, term2, term....
 Topic 2: term1, term2, term3...
 ...
 Topic n: term1, ........

 Doc1 : Topic1, Topic2,...
 Doc2 : Topic1, Topic2,...
 Doc3 : Topic1, Topic2,...
 ...
 Docn ：Topic1, Topic2,...

我将 LDA 应用于 Spark Mllib 的示例数据，如下所示：

1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

之后我得到以下结果：

topics: org.apache.spark.mllib.linalg.Matrix = 

10.33743440804936   9.104197117225599   6.5583684747250395  
6.342536927434482   12.486281081997593  10.171181990567925  
2.1728012328444692  2.1939589470020042  7.633239820153526   
17.858082227094904  9.405347532724434   12.736570240180663  
13.226180094790433  3.9570395921153536  7.816780313094214   
6.155778858763581   10.224730593556806  5.619490547679611   
7.834725138351118   15.52628918346391   7.63898567818497    
4.419396221560405   3.072221927676895   2.5083818507627     
1.4984991123084432  3.5227422247618927  2.978758662929664   
5.696963722524612   7.254625667071781   11.048410610403607  
11.080658179168758  10.11489350657456   11.804448314256682

每一列都是主题的术语分布。一共有 3 个主题，每个主题是 11 个词汇的分布。

我认为有12个文档，每个文档有11个词汇表。我的麻烦是

如何找到每个文档的主题分布？
为什么每个主题都有超过11个词汇的分布，而数据中总共有10个不同的词汇（0-9）？
为什么每列的总和不等于100（按我的理解就是100%）？

Answer 1

您可以通过调用获取每个文档的主题分布 DistributedLDAModel.topicDistributions() 或 DistributedLDAModel.javaTopicDistributions() 在星火 1.4 中。这仅在您的模型优化器设置为 EMLDAOptimizer（默认值）时有效。

有an example here and the documentation here.

在Java中看起来像这样：

LDAModel ldaModel = lda.setK(k.intValue()).run(corpus);
JavaPairRDD<Long,Vector> topic_dist_over_docs = ((DistributedLDAModel) ldaModel).javaTopicDistributions();

关于第二个问题：

LDA 模型 returns 每个主题的词汇表中每个单词的概率分布。因此，您有三个主题（三列），每个主题有 11 行（词汇表中的每个单词一行），因为词汇表大小为 11。

Answer 2

为什么每列的总和不等于100（我的理解是100%）

使用 describeTopics 方法获取主题在单词（词汇）上的分布。
每个词的概率之和可能是1.0（差不多，但不可能精确到1.0）

java中的示例代码：

    Tuple2<int[], double[]>[] topicDesces = ldaModel.describeTopics();
    int topicCount = topicDesces.length;

    for( int t=0; t<topicCount; t++ ){

        Tuple2<int[], double[]> topic = topicDesces[t];
        System.out.print("Topic " + t + ":");

        int[] indices = topic._1();
        double[] values = topic._2();
        double sum = 0.0d;
        int wordCount = indices.length;

        for( int w=0; w<wordCount; w++ ){

            double prob = values[w];
            System.out.format("\t%d:%f", indices[w] , prob);
            sum += prob;
        }
        System.out.println( "(" + sum + ")");
    }

在理解 MLlib 中的 LDA 主题模型时遇到麻烦

Trouble in understanding the LDA topic model in MLlib

lda

apache-spark

apache-spark-mllib