如何在 LogisticRegression 中获得概率和分类？

Question

我正在使用 Logistic 回归算法进行多 class 文本 class 化。我需要一种方法来获得置信度分数和类别。例如 - 如果我将 text = "Hello this is sample text" 传递给模型，我应该得到 predicted class = Class A 和 confidence = 80% 结果。

Answer 1

对于scikit-learn中的大部分模型，我们可以通过predict_proba. Bear in mind that this is the actual output of the logistic function, the resulting classification is obtained by selecting the output with highest probability, i.e. an argmax is applied on the output. If we see the implementation here得到类的概率估计，可以看出它本质上是在做：

def predict(self, X):
    # decision func on input array
    scores = self.decision_function(X)
    # column indices of max values per row
    indices = scores.argmax(axis=1)
    # index class array using indices
    return self.classes_[indices]

在调用 predict_proba 而不是 predict 的情况下，返回 scores。这是训练 LogisticRegression:

的示例用例

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

lr= LogisticRegression()
lr.fit(X_train, y_train)
y_pred_prob = lr.predict_proba(X_test)

y_pred_prob
array([[1.06906558e-02, 9.02308167e-01, 8.70011771e-02],
       [2.57953117e-06, 7.88832490e-03, 9.92109096e-01],
       [2.66690975e-05, 6.73454730e-02, 9.32627858e-01],
       [9.88612145e-01, 1.13878133e-02, 4.12714660e-08],
       ...

并且我们可以通过取 argmax 来获得概率，并将类的数组索引为：

classes = load_iris().target_names
classes[indices]
array(['virginica', 'virginica', 'versicolor', 'virginica', 'setosa',
       'versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',...

所以对于单个预测，通过预测的概率我们可以很容易地做这样的事情：

y_pred_prob = lr.predict_proba(X_test[0,None])
ix = y_pred_prob.argmax(1).item()

print(f'predicted class = {classes[ix]} and confidence = {y_pred_prob[0,ix]:.2%}')
# predicted class = virginica and confidence = 90.75%

如何在 LogisticRegression 中获得概率和分类？

How to get probabilities along with classification in LogisticRegression?

python

machine-learning

scikit-learn

logistic-regression

multiclass-classification