如何组合多个朴素贝叶斯分类器的输出？

Question

我是新手。

我有一组使用 Sklearn 工具包中的朴素贝叶斯分类器 (NBC) 构建的弱 class 器。

我的问题是如何结合每个 NBC 的输出来做出最终决定。我希望我的决定是基于概率而不是标签。

我在 python 中制作了以下程序。我假设 2 class 问题来自 sklean 中的 iris 数据集。对于 demo/learning 假设我制作了一个 4 NBC，如下所示。

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

import numpy as np
import cPickle
import math

iris = datasets.load_iris()

gnb1 = GaussianNB()
gnb2 = GaussianNB()
gnb3 = GaussianNB()
gnb4 = GaussianNB()

#Actual dataset is of 3 class I just made it into 2 class for this demo
target = np.where(iris.target, 2, 1)

gnb1.fit(iris.data[:, 0].reshape(150,1), target)
gnb2.fit(iris.data[:, 1].reshape(150,1), target)
gnb3.fit(iris.data[:, 2].reshape(150,1), target)
gnb4.fit(iris.data[:, 3].reshape(150,1), target)

#y_pred = gnb.predict(iris.data)
index = 0
y_prob1 = gnb1.predict_proba(iris.data[index,0].reshape(1,1))
y_prob2 = gnb2.predict_proba(iris.data[index,1].reshape(1,1))
y_prob3 = gnb3.predict_proba(iris.data[index,2].reshape(1,1))
y_prob4 = gnb4.predict_proba(iris.data[index,3].reshape(1,1))

#print y_prob1, "\n", y_prob2, "\n", y_prob3, "\n", y_prob4 

 # I just added it over all for each class
pos = y_prob1[:,1] + y_prob2[:,1] + y_prob3[:,1] + y_prob4[:,1]
neg = y_prob1[:,0] + y_prob2[:,0] + y_prob3[:,0] + y_prob4[:,0]

print pos
print neg

如您所见，我只是简单地将每个 NBC 的概率添加为最终得分。我想知道这是否正确？

如果我没有说错，请您提出一些想法，以便我自己改正。

Answer 1

首先 - 你为什么要这样做？这里应该有一个朴素贝叶斯，而不是每个特征。看来你不明白分类器的思想。你所做的实际上是朴素贝叶斯在内部所做的——它独立地处理每个特征，但由于这些是概率，你应该乘它们，或者加对数, 所以:

你只需要一个NB，gnb.fit(iris.data, target)
如果你坚持要有很多NB，你应该通过对数的乘法或加法来合并它们（从数学的角度来看是一样的，但是乘法在数值意义上不太稳定）

pos = y_prob1[:,1] * y_prob2[:,1] * y_prob3[:,1] * y_prob4[:,1]

或

pos = np.exp(np.log(y_prob1[:,1]) + np.log(y_prob2[:,1]) + np.log(y_prob3[:,1]) + np.log(y_prob4[:,1]))

你也可以直接通过gnb.predict_log_proba而不是gbn.predict_proba预测对数。

但是，这种方法有一个错误 - 朴素贝叶斯还将在您的每个概率中包含先验，因此您的分布将非常偏斜。所以你必须手动归一化

pos_prior = gnb1.class_prior_[1] # 所有模型都有相同的先验，所以我们可以使用 gnb1

pos = pos_prior_ * (y_prob1[:,1]/pos_prior_) * (y_prob2[:,1]/pos_prior_) * (y_prob3[:,1]/pos_prior_) * (y_prob4[:,1]/pos_prior_)

简化为

pos = y_prob1[:,1] * y_prob2[:,1] * y_prob3[:,1] * y_prob4[:,1] / pos_prior_**3

并登录到

pos = ... - 3 * np.log(pos_prior_)

所以再一次 - 你应该使用“1”选项。

Answer 2

几乎是正确的。缺少的一件事是您需要将他的 pos 结果（概率的乘积，除以先验）除以类的 pos 结果之和。否则，所有类的概率之和将不等于1。

这是一个示例代码，用于测试具有 6 个特征的数据集的此过程的结果：

# Use one Naive Bayes for all 6 features:

gaus = GaussianNB(var_smoothing=0)
gaus.fit(X, y)
y_prob1 = gaus.predict_proba(X)

# Use one Naive Bayes on each half of the features and multiply the results:

gaus1 = GaussianNB(var_smoothing=0)
gaus1.fit(X[:, :3], y)
y_log_prob1 = gaus1.predict_log_proba(X[:, :3])

gaus2 = GaussianNB(var_smoothing=0)
gaus2.fit(X[:, 3:], y)
y_log_prob2 = gaus2.predict_log_proba(X[:, 3:])

pos = np.exp(y_log_prob1 + y_log_prob2 - np.log(gaus1.class_prior_))
y_prob2 = pos / pos.sum(axis=1)[:,None]

y_prob1 应该等于 y_prob2 除了数字错误（var_smoothing=0 有助于减少错误）。

如何组合多个朴素贝叶斯分类器的输出？

How to combine the outputs of multiple naive bayes classifier?

python

artificial-intelligence

machine-learning

bayesian

scikit-learn