如何在 sklearn Logistic 回归的 one-vs-rest 方案中对概率进行归一化?
How the probabilities are normalized in one-vs-rest scheme of sklearn Logistic Regression?
在 sklearn LogisticRegression classifer 中,我们可以将 muti_class
选项设置为 ovr
,代表 one-vs-rest,如以下代码片段所示:
# logistic regression for multi-class classification using built-in one-vs-rest
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
现在,这个 classifier 可以将概率分配给给定实例的不同 classes:
# make predictions
yhat = model.predict_proba(X)
每个实例的概率总和为 1:
array([[0.16973178, 0.46755188, 0.36271634],
[0.58228627, 0.0928127 , 0.32490103],
[0.28241256, 0.51175978, 0.20582766],
...,
[0.17922774, 0.71300755, 0.10776471],
[0.05888508, 0.24924809, 0.69186683],
[0.25808835, 0.68599321, 0.05591844]])
我的问题:在 one-vs-rest 方法中,为每个 class 训练一个 classifier。因此,我们期望每个 class 独立于其他 class 的概率。如何将概率归一化为 1?
概率通过除以行总和(即每个样本的 class 概率之和)归一化,这是 source code:
prob /= prob.sum(axis=1).reshape((prob.shape[0], -1))
下面的代码展示了如何使用这个公式来复制模型输出:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# generate some data
X, y = make_classification(n_classes=3, n_features=10, n_informative=5, n_redundant=5, n_samples=1000, random_state=1)
# fit the model
model = LogisticRegression(multi_class='ovr')
model.fit(X, y)
prob_pred = model.predict_proba(X)
print(prob_pred)
# [[0.16973178 0.46755188 0.36271634]
# [0.58228627 0.0928127 0.32490103]
# [0.28241256 0.51175978 0.20582766]
# ...
class_pred = model.predict(X)
print(class_pred)
# [1 0 1 2 0 2 1 2 0 1 1 0 2 1 0 1 2 0 1 0 ...
# replicate the model's outputs
classes = np.unique(y)
n_classes = len(classes)
n_samples = len(y)
prob_pred = np.zeros((n_samples, n_classes))
class_pred = np.zeros(n_samples)
for c in classes:
y_ = np.where(y == c, 1, 0)
model = LogisticRegression()
model.fit(X, y_)
prob_pred[:, c] = model.predict_proba(X)[:, 1]
prob_pred /= prob_pred.sum(axis=1).reshape((prob_pred.shape[0], -1))
print(prob_pred)
# [[0.16973178 0.46755188 0.36271634]
# [0.58228627 0.0928127 0.32490103]
# [0.28241256 0.51175978 0.20582766]
# ...
class_pred = classes[np.argmax(prob_pred, axis=1)]
print(class_pred)
# [1 0 1 2 0 2 1 2 0 1 1 0 2 1 0 1 2 0 1 0 ...
如您所见here,
multiclass 通过对实例 x 在所有 classes 上的每个 class 的得分进行归一化处理,如下所示:
实例属于 class k 由
给出
f代表决策函数,K是classes的个数。
在 sklearn LogisticRegression classifer 中,我们可以将 muti_class
选项设置为 ovr
,代表 one-vs-rest,如以下代码片段所示:
# logistic regression for multi-class classification using built-in one-vs-rest
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
现在,这个 classifier 可以将概率分配给给定实例的不同 classes:
# make predictions
yhat = model.predict_proba(X)
每个实例的概率总和为 1:
array([[0.16973178, 0.46755188, 0.36271634],
[0.58228627, 0.0928127 , 0.32490103],
[0.28241256, 0.51175978, 0.20582766],
...,
[0.17922774, 0.71300755, 0.10776471],
[0.05888508, 0.24924809, 0.69186683],
[0.25808835, 0.68599321, 0.05591844]])
我的问题:在 one-vs-rest 方法中,为每个 class 训练一个 classifier。因此,我们期望每个 class 独立于其他 class 的概率。如何将概率归一化为 1?
概率通过除以行总和(即每个样本的 class 概率之和)归一化,这是 source code:
prob /= prob.sum(axis=1).reshape((prob.shape[0], -1))
下面的代码展示了如何使用这个公式来复制模型输出:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# generate some data
X, y = make_classification(n_classes=3, n_features=10, n_informative=5, n_redundant=5, n_samples=1000, random_state=1)
# fit the model
model = LogisticRegression(multi_class='ovr')
model.fit(X, y)
prob_pred = model.predict_proba(X)
print(prob_pred)
# [[0.16973178 0.46755188 0.36271634]
# [0.58228627 0.0928127 0.32490103]
# [0.28241256 0.51175978 0.20582766]
# ...
class_pred = model.predict(X)
print(class_pred)
# [1 0 1 2 0 2 1 2 0 1 1 0 2 1 0 1 2 0 1 0 ...
# replicate the model's outputs
classes = np.unique(y)
n_classes = len(classes)
n_samples = len(y)
prob_pred = np.zeros((n_samples, n_classes))
class_pred = np.zeros(n_samples)
for c in classes:
y_ = np.where(y == c, 1, 0)
model = LogisticRegression()
model.fit(X, y_)
prob_pred[:, c] = model.predict_proba(X)[:, 1]
prob_pred /= prob_pred.sum(axis=1).reshape((prob_pred.shape[0], -1))
print(prob_pred)
# [[0.16973178 0.46755188 0.36271634]
# [0.58228627 0.0928127 0.32490103]
# [0.28241256 0.51175978 0.20582766]
# ...
class_pred = classes[np.argmax(prob_pred, axis=1)]
print(class_pred)
# [1 0 1 2 0 2 1 2 0 1 1 0 2 1 0 1 2 0 1 0 ...
如您所见here, multiclass 通过对实例 x 在所有 classes 上的每个 class 的得分进行归一化处理,如下所示: 实例属于 class k 由
给出f代表决策函数,K是classes的个数。