使用 LDA 的主题建模信息作为特征，通过 SVM 进行文本分类

Question

我想使用主题建模信息作为提供给 svm 分类器的特征来执行文本分类。所以我想知道由于数据集的两个分区的语料库发生变化，如何通过对数据集的训练和测试分区执行 LDA 来生成主题建模特征？

我的假设有误吗？

您能否提供一个示例，说明如何使用 scikit learn 来做到这一点？

Answer 1

你的假设是正确的。你所做的是在你的训练数据上训练你的 LDA，然后根据训练好的模型转换训练和测试数据。

所以你会有这样的东西：

from sklearn.decomposition import LatentDirichletAllocation as LDA
lda = LDA(n_topics=10,...)
lda.fit(training_data)
training_features = lda.transform(training_data)
testing_features = lda.transform(testing_data)

如果我是你，我会使用 numpy.hstack 或 scipy.hstack 将 LDA 特征与 Bag of words 特征连接起来（如果你的弓形特征稀疏）。

使用 LDA 的主题建模信息作为特征，通过 SVM 进行文本分类

Use topic modeling information from LDA as features to perform text classification through SVM

python

classification

svm

lda