如何用 mallet 预测一批文档的主题

Question

我正在使用 scala 项目中的 mallet。在训练主题模型并获得推理文件后，我尝试将主题分配给新文本。问题是我用不同的调用方法得到了不同的结果。以下是我尝试过的方法：

正在创建一个新的 InstanceList 并仅提取一个文档并从 InstanceList 中获取主题结果

somecontentList.map(text=>getTopics(text, model))
def getTopics(text:String, inferencer: TopicInferencer):Array[Double]={
    val testing = new InstanceList(pipe)
    testing.addThruPipe(new Instance(text, null, "test instance", null))
    inferencer.getSampledDistribution(testing.get(0), iter, 1, burnIn)
}

将所有内容放入 InstanceList 并一起预测主题。

val testing = new InstanceList(pipe)
somecontentList.foreach(text=>
    testing.addThruPipe(new Instance(text, null, "test instance", null))
)
(0 until testing.size).map(i=> 
    ldaModel.getSampledDistribution(testing.get(i), 100, 1, 50))

除了第一种情况，这两种方法产生的结果截然不同。推理器的正确使用方法是什么？

附加信息：我检查了实例数据。

0: topic (0)
1: beaten (1)
2: death (2)
3: examples (3)
4: forum (4)
5: wanted (5)
6: contributing (6)

我假设括号中的数字是预测中使用的单词索引。当我将所有文本放入 InstanceList 时，索引不同，因为集合包含更多文本。不确定在模型预测过程中如何准确地考虑这些信息。

Answer 1

我也发现了类似的问题，尽管使用了 R 插件。我们最终分别为每个 row/document 调用了 Inferencer。

但是，由于绘图和推理机的随机性，调用同一行时，推理会有一些差异。虽然，我同意差异应该很小。

Answer 2

请记住，新实例必须使用管道从 Inferencer 中记录的原始数据中导入，以便字母匹配。目前尚不清楚 pipe 在 scala 代码中的来源，但前六个单词看起来像是以 0 开头的 id 的事实表明这是一个新的字母表。

如何用 mallet 预测一批文档的主题

how to predict topics for a batch of documents with mallet

mallet

lda

topicmodels