如何使用 multiclass 逻辑回归模型的系数来预测观测值 class 隶属度的概率？

Question

我正在尝试解决一个类似于 Fisher 虹膜 classification 的问题。问题是我可以在我的计算机上训练模型，但是给定模型必须在无法安装 python 和 scikit 学习的计算机上预测 class 成员资格。我想了解如何在收到逻辑回归模型的系数后，在不使用模型的预测方法的情况下预测属于某个 class。以 Fisher 问题为例，我执行以下操作。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, f1_score

# data preparation
iris = load_iris()
data = pd.DataFrame(data=np.hstack([iris.data, iris.target[:, np.newaxis]]), 
                    columns=iris.feature_names + ['target'])
names = data.columns

# split data
X_train, X_test, y_train, y_test = train_test_split(data[names[:-1]], data[names[-1]], random_state=42)

# train model
cls = make_pipeline(
    StandardScaler(),
    LogisticRegression(C=2, random_state=42)
)
cls = cls.fit(X_train.to_numpy(), y_train)
preds_train = cls.predict(X_train)

# prediction
preds_test = cls.predict(X_test)

# scores
train_score = accuracy_score(preds_train, y_train), f1_score(preds_train, y_train, average='macro') # on train data
# train_score = (0.9642857142857143, 0.9653621232568601)
test_score = accuracy_score(preds_test, y_test), f1_score(preds_test, y_test, average='macro') # on test data
# test_score = (1.0, 1.0)

# model coefficients
cls[1].coef_, cls[1].intercept_

>>> (array([[-1.13948079,  1.30623841, -2.21496793, -2.05617771],
            [ 0.66515676, -0.2541143 , -0.55819748, -0.86441227],
            [ 0.47432404, -1.05212411,  2.77316541,  2.92058998]]),
      array([-0.35860337,  2.43929019, -2.08068682]))

现在我有了模型的系数。我想用它们来做出预测。首先，我使用 predict 方法对测试样本的前五个观察值进行预测。

preds_test = cls.predict_proba(X_test)
preds_test[0:5]

>>>array([[5.66019001e-03, 9.18455687e-01, 7.58841233e-02],
          [9.75854479e-01, 2.41455095e-02, 1.10881450e-08],
          [1.18780156e-09, 6.53295166e-04, 9.99346704e-01],
          [6.71574900e-03, 8.14174200e-01, 1.79110051e-01],
          [6.98756622e-04, 8.09096425e-01, 1.90204818e-01]])

然后我使用模型的系数手动计算观测值的 class 概率的预测。

# define two functions for making predictions
def logit(x, w):
    return np.dot(x, w)

# from here: 
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

n, k = X_test.shape
X_ = np.hstack((np.ones((n, 1)), X_test)) # add column with 1 for intercept
weights = np.hstack((cls[1].intercept_[:, np.newaxis], cls[1].coef_)) # create weights matrix
results = softmax(logit(X_, weights.T)) # calculate probabilities 
results[0:5]

>>>array([[3.67343725e-14, 4.63938438e-06, 9.99995361e-01],
          [2.81976786e-05, 8.63083152e-01, 1.36888650e-01],
          [1.24572182e-22, 5.47800683e-11, 1.00000000e+00],
          [3.32990060e-14, 3.08352323e-06, 9.99996916e-01],
          [2.66415118e-15, 1.78252465e-06, 9.99998217e-01]])

如果你比较得到的两个结果（preds_test[0:5]和results[0:5]），你会发现它们根本不重合。请解释我做错了什么，以及如何在不使用预测方法的情况下使用模型的系数来计算预测。

Answer 1

我忘记应用了定标器。如果你稍微改变一下代码，那么结果是一样的。

scaler = StandardScaler()
scaler.fit(X_train)
X_test_transf = scaler.transform(X_test)

def logit(x, w):
    return np.dot(x, w)

def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

n, k = X_test_transf.shape
X_ = np.hstack((np.ones((n, 1)), X_test_transf))
weights = np.hstack((cls[1].intercept_[:, np.newaxis], cls[1].coef_))
results = softmax(logit(X_, weights.T))
np.allclose(preds_test, results)

>>>True

Answer 2

每个 predict_proba 有两个值。第一个值是事件不发生的概率和事件发生的概率。 predict_proba(X)[:,1] 得到事件发生的概率。

如何使用 multiclass 逻辑回归模型的系数来预测观测值 class 隶属度的概率？

How can the coefficients of the multiclass logistic regression model be used to predict the probabilities of class membership of observations?

scikit-learn

logistic-regression