sklearn ROC曲线

Question

我有 10 个 class，我的 y_test 的形状是 (1000, 10)，看起来像这样：

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

如果我使用以下内容，其中 i 是 class 数字

fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])

应该y_pred是

y_pred = model.predict(x_test)

OR

y_pred = np.argmax(model.predict(x_test), axis=1)
lb = LabelBinarizer()
lb.fit(y_test)
y_pred = lb.transform(y_pred)

第一个选项是这样的：

[[6.87280996e-11 6.28617670e-07 9.96915460e-01 ... 3.08361766e-03
  3.47333212e-14 2.83545876e-09]
 [7.04240659e-30 1.51786850e-07 8.49807921e-28 ... 6.62584656e-33
  6.97696034e-19 1.01019222e-20]
 [2.97537670e-14 2.67199534e-24 2.85646610e-19 ... 2.19898160e-15
  7.03626012e-22 7.56072279e-18]
 ...
 [1.63774752e-15 1.32784101e-06 1.23182635e-05 ... 3.60217566e-14
  6.01247484e-05 2.61179358e-01]
 [2.09420733e-35 6.94865276e-10 1.14242395e-22 ... 5.08080394e-22
  1.20934697e-19 1.77760468e-17]
 [1.68334747e-13 8.53335252e-04 4.40571597e-07 ... 1.70050384e-06
  1.48684137e-06 2.93400045e-03]]

形状为 (1000,10).

其中后一个选项给出

[[0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

形状为(1000,10)

哪种方式才是正确的做法？换句话说，当传递给 sklearn.metrics.roc_curve().

时，这个 y_pred 会是什么

忘记提及，使用第一个选项为所有 class 提供了极高（几乎 1）的 AUC 值，而第二个选项似乎生成了合理的 AUC 值。

使用这两个选项的ROC曲线如下，哪个看起来更正确？

Answer 1

第一个选项没有问题，这就是documentation所要求的：

y_scorendarray of shape (n_samples,)

Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).

另外，第一个图看起来像ROC曲线，而第二个图很奇怪。

最后，ROC曲线打算研究“不同的分类阈值”。这意味着您需要“作为概率”（置信度）的预测，而不是 0 和 1。

当你拿 argmax 时，你会丢掉 probabilities/confidences，因此无法研究阈值。

sklearn ROC曲线

sklearn ROC curver

roc

scikit-learn

keras