SVM 中维度过多的样本数据

Question

我正在处理训练和测试数据作为 Google 搜索片段。

训练数据包含 10,060 个片段。每行上的每个片段，每个片段都由一个 words/terms 列表加上末尾的 class 标签组成。

有8class个标签：

Business,Computers,Culture-Arts,Entertainment,Education-Science,Engineering,Health,Politics-Society,Sports

以下是数据集中的一些行：

manufacture manufacturer directory directory china taiwan products manufacturers directory- taiwan china products manufacturer direcory exporter directory supplier directory suppliers business

empmag electronics manufacturing procurement homepage electronics manufacturing procurement magazine procrement power products production essentials data management business

dfma truecost paper true cost overseas manufacture product design costs manufacturing products china manufacturing redesigned product china save business

如您所见，数据应具有相同的维数才能使用 SVM。

我正在考虑使用 1 来指示某个单词是否出现在特定行中，否则使用 0，因此每一行都是一个 0/1 向量。但是，维度会太多。

我的问题：是否有任何其他方法可以预处理数据以有效地执行 SVM？

Answer 1

在使用 SVM 执行文本分类之前，您应该检查 term-weighting 和 feature selection。

默认方法是：

检查 tfc 词权重。这是基于所谓的逆文档频率乘以术语频率（在当前文档中）。
检查基于 Information Gain 的特征选择
在1.和2.的基础上改造你的文档
使用 SVM 执行文本分类。

我推荐以下出版物以进一步理解/阅读。在此出版物中，您将找到研究社区中用于基于 SVM 的文本分类的典型方法：

Joachims T. (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec C., Rouveirol C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol 1398. Springer, Berlin, Heidelberg
Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In International Conference on Machine Learning (ICML), 1997.
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.

SVM 中维度过多的样本数据

Sample data with too many dimensions in SVM

svm

libsvm

supervised-learning