如何减少文本分类中的特征数量？

How to reduce the number of features in text classification?

python
nlp
text-classification
naivebayes
countvectorizer

我正在做方言文本分类，我正在使用带有朴素贝叶斯的 countVectorizer。特征的数量太多了，我收集了 4 种方言的 20k 条推文。每种方言都有 5000 条推文。特征总数为 43K。我在想也许这就是为什么我会过度拟合的原因。因为我在新数据上测试的时候准确率下降了很多。那么如何确定特征的数量以避免过度拟合数据呢？

你可以将参数max_features设置为5000，例如，它可能有助于防止过拟合。您还可以修改 max_df（例如将其设置为 0.95）

测试数据下降的原因是curse of dimensionality. You can use some dimensionality reduction method to reduce this effect. Possible choice is Latent Semantic Analysis implemented in sklearn。

如何减少文本分类中的特征数量？

How to reduce the number of features in text classification?

python

nlp

text-classification

naivebayes

countvectorizer