基于主题建模的文档相关性评分

Document relevancy score based on topic modelling

我目前有一个使用 MALLET (http://mallet.cs.umass.edu/topics.php) 训练的主题模型,它基于大约 80 000 篇收集的新闻文章(这些文章都属于一个类别)。

我希望在每次有新文章进入时给出相关性分数(可能与类别相关也可能不相关)。有什么办法可以做到这一点?我已经阅读了 td-idf,但它似乎是根据现有文章而不是任何新文章给出分数。最终目标是过滤掉可能不相关的文章。

非常感谢任何想法或帮助。谢谢!

拥有模型(主题)后,您可以根据文档测试新的未见文档(参数 --evaluator-filename [FILENAME] 是您传递新的未见文档的地方)Topic Held-out probability:

Topic Held-out probability

--evaluator-filename [FILENAME] The previous section describes how to get topic proportions for new documents. We often want to estimate the log probability of new documents, marginalized over all topic configurations. Use the MALLET command bin/mallet evaluate-topics --help to get information on using held-out probability estimation. As with topic inference, you must make sure that the new data is compatible with your training data. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.

注:我用的比较多的是gensim LDA和LSI,你可以传新文件如下:

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(lda_model[new_vec])

#output: [(0, 0.020229542), (1, 0.49642297)

Interpretation: you can see (1, 0.49642297) meaning that from the 2 topics(categories) we have the new document is close represented by topic #1. So in your case you can take the maximum from the outputting list and you have the relevancy "coefficient" so high coefficient to be in the category and low not (added 2 topics as per better visualization and in your case if you have only #1 topic than just add a simple threshold of the minim you want to consider and if did fail above, for example 0.40, than is in the category otherwise not).