Sklearn precision recall curve pos_label for unbalanced dataset which class probability to use

Question

我想使用精确召回分数来评估我的模型，因为我的数据不平衡。因为我有一个二进制 classification，所以我在 NN 的末尾使用了 softmax。输出分数和真实标签类似于：

y_score = [[0.4, 0.6],
           [0.6, 0.4],
           [0.3, 0.7],
              ...   ]
y_true = [1,
          0,
          0,
         ...]

其中y_score[:,0]对应的概率是class0.
我的正标签是0，因此负标签是1 就我而言。

由于我的数据集是非平衡的（负数多于正数）我想使用精确召回分数 (AUPRC) 来评估我的 classifier。函数 sklearn.metrics.precision_recall_curve 接受一个参数 pos_label，我将其设置为 pos_label = 0。但是参数probas_pred 采用形状概率的 ndarray (n_samples,).

我的问题是，既然我设置了 pos_label = 0，我应该为 probas_pred 选择我的 y_score 哪一列？

我希望我的问题很清楚。
提前致谢！

Answer 1

它应该是上面示例中的第一列，您可以通过以下方式检查确定。

使用示例数据集：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_blobs
from sklearn.metrics import precision_recall_curve

X, y = make_blobs(n_samples=[400,2000], centers=None,n_features=5,random_state=999,cluster_std=5)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=111)

训练分类器：

clf = MLPClassifier(hidden_layer_sizes=(3, 3), random_state=999)
clf.fit(X_train, y_train)

勾选类:

clf.classes_
array([0, 1])

你可以把它放到dataframe上看看是否正确：

    0   1   actual
0   0.999734    0.000266    0
1   0.001253    0.998747    1
2   0.000137    0.999863    1
3   0.000113    0.999887    1
4   0.003173    0.996827    1
... ... ... ...
475 0.014316    0.985684    1
476 0.012767    0.987233    1
477 0.062735    0.937265    1
478 0.000048    0.999952    1
479 0.999733    0.000267    0

然后计算一下：

prec,recall,thres = precision_recall_curve(y_true=y_test , probas_pred= clf.predict_proba(X_test)[:,0], pos_label=0)

绘制它..如果你翻转了你的值，这看起来真的很奇怪，但低于它的正确值：

plt.plot(prec,recall)

Sklearn precision recall curve pos_label for unbalanced dataset which class probability to use

Sklearn precision recall curve pos_label for unbalanced dataset which class probability to use

python

scikit-learn

precision-recall