大量类与多项式朴素贝叶斯 (scikit-learn)

Huge number of classes with Multinominal Naive Bayes (scikit-learn)

每当我开始拥有更大数量的类（1000 或更多）时，MultinominalNB 变得超级慢并且占用 GB 的 RAM。所有支持 .partial_fit() 的 scikit 学习分类算法（SGDClassifier、Perceptron）也是如此。使用卷积神经网络时，10000 类没有问题。但是当我想在相同的数据上训练 MultinominalNB 时，我的 12GB RAM 是不够的，而且速度非常非常慢。以我对朴素贝叶斯的理解，即使有很多类，也应该快很多。这可能是 scikit-learn 实现的问题（可能是 .partial_fit() 函数的问题）？我如何在 10000+ 类上训练 MultinominalNB/SGDClassifier/Perceptron（分批）？

没有太多信息的简短回答：

MultinomialNB 为每个类拟合一个 独立模型 ，因此，如果你有 C=10000+ 类它将适合 C=10000+ 个模型，因此只有模型参数将为 [n_classes x n_features]，如果 n_features 很大，这将占用大量内存。
scikits-learn 的 SGDClassifier 使用 OVA (one-versus-all) 策略来训练多类模型（因为 SGDC 本身不是多类的）因此，另一个 C=10000+ 模型需要训练
和Perceptron，来自scikits-learn的文档：

Perceptron and SGDClassifier share the same underlying implementation. In fact, Perceptron() is equivalent to SGDClassifier(loss=”perceptron”, eta0=1, learning_rate=”constant”, penalty=None).

因此，您提到的所有 3 个分类器都不能很好地处理大量类，因为需要为每个类训练一个独立的模型。我建议您尝试一些本质上支持多类分类的方法，例如 RandomForestClassifier.

大量类与多项式朴素贝叶斯 (scikit-learn)

Huge number of classes with Multinominal Naive Bayes (scikit-learn)

machine-learning

scikit-learn

naivebayes

大量 类 与多项式朴素贝叶斯 (scikit-learn)

Huge number of classes with Multinominal Naive Bayes (scikit-learn)

machine-learning

scikit-learn

naivebayes

大量类与多项式朴素贝叶斯 (scikit-learn)