SGD 分类器 Precision-Recall 曲线

Question

我正在处理二元分类问题，我有一个像这样的 sgd 分类器：

sgd = SGDClassifier(
    max_iter            = 1000, 
    tol                 = 1e-3,
    validation_fraction = 0.2,
    class_weight = {0:0.5, 1:8.99}
)

我将它安装在我的训练集上并绘制了精确召回曲线：

from sklearn.metrics import plot_precision_recall_curve
disp = plot_precision_recall_curve(sgd, X_test, y_test)

鉴于scikit-learn中的sgd分类器默认使用loss="hinge"，怎么可能画出这条曲线呢？我的理解是 sgd 的输出不是概率性的——它要么是 1/0。所以没有“阈值”，但 sklearn 精确召回曲线绘制了一个具有不同类型阈值的锯齿形图。这是怎么回事？

Answer 1

您描述的情况与 documentation example 中发现的情况几乎相同，使用虹膜数据的前 2 类和 LinearSVC 分类器（该算法使用平方铰链损失，这，就像你在这里使用的铰链损失一样，导致分类器只产生二元结果而不是概率结果）。结果图有：

即在质量上与您的相似。

尽管如此，你的问题是合理的，而且确实是一个不错的问题；当我们的分类器确实不产生概率预测时（因此任何阈值概念听起来都不相关），我们怎么会得到类似于概率分类器产生的行为？

要了解为什么会这样，我们需要深入研究 scikit-learn 源代码，从这里使用的 plot_precision_recall_curve 函数开始，然后沿着线程进入兔子洞...

从plot_precision_recall_curve的source code开始，我们发现：

y_pred, pos_label = _get_response(
    X, estimator, response_method, pos_label=pos_label)

因此，为了绘制 PR 曲线，预测 y_pred 是 而不是 由我们的分类器的 predict 方法直接产生的，但是通过 _get_response() scikit-learn 的内部函数。

_get_response() 依次包含以下行：

prediction_method = _check_classifier_response_method(
    estimator, response_method)

y_pred = prediction_method(X)

这最终将我们引向了 _check_classifier_response_method() 内部函数；您可以在 else 语句后查看完整的 source code of it - what is of interest here are the following 3 lines：

predict_proba = getattr(estimator, 'predict_proba', None)
decision_function = getattr(estimator, 'decision_function', None)
prediction_method = predict_proba or decision_function

到现在为止，您可能已经开始理解要点了：在幕后，plot_precision_recall_curve 检查所使用的分类器是否可以使用 predict_proba() 或 decision_function() 方法；并且如果 predict_proba() 不可用，就像您在此处使用具有铰链损失的 SGDClassifier 的情况（或具有平方铰链损失的 LinearSVC 分类器的 documentation example ），它恢复为 decision_function() 方法，以计算 y_pred 随后将用于绘制 PR（和 ROC）曲线。

以上内容可以说已经回答了您的编程关于 scikit-learn 在这种情况下究竟如何生成绘图和基础计算的问题；关于是否以及为什么使用非概率分类器的 decision_function() 确实是获得 PR（或 ROC）曲线的正确和合法方法的进一步理论研究超出了 SO 的范围，它们应该被发送到 Cross Validated，如有必要。

SGD 分类器 Precision-Recall 曲线

SGD classifier Precision-Recall curve

python

machine-learning

scikit-learn

precision-recall

stochastic-gradient