二元分类精度评估中的疑点

Question

我正在使用 Keras 的 sequential() 模型进行二进制 class化。我对它的准确性评估有些怀疑。

我正在为它计算 AUC-ROC。 为此，我应该使用预测概率还是预测 class？

解释：

训练 model 后，我正在 model.predict() 寻找训练和验证数据的预测值（代码如下）。

y_pred_train = model.predict(x_train_df).ravel()
y_pred_val = model.predict(x_val_df).ravel()

fpr_train, tpr_train, thresholds_roc_train = roc_curve(y_train_df, y_pred_train, pos_label=None)
fpr_val, tpr_val, thresholds_roc_val = roc_curve(y_val_df, y_pred_val, pos_label=None)

roc_auc_train = auc(fpr_train, tpr_train)
roc_auc_val = auc(fpr_val, tpr_val)

plt.figure()
lw = 2
plt.plot(fpr_train, tpr_train, color='darkgreen',lw=lw, label='ROC curve Training (area = %0.2f)' % roc_auc_train)
plt.plot(fpr_val, tpr_val, color='darkorange',lw=lw, label='ROC curve Validation (area = %0.2f)' % roc_auc_val)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--',label='Base line')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

此图显示为 this。训练和验证精度分别为 0.76 和 0.76。

model.predict() 给出的概率不是实际预测的 class，因此我将上述代码示例的前两行更改为 class 为；

y_pred_train = (model.predict(x_train_df).ravel() > 0.5).astype("int32")
y_pred_val = (model.predict(x_test_df).ravel() > 0.5).astype("int32")

所以这现在根据 class 值计算 AUC-ROC（我猜）。但是我得到的准确性非常不同而且很低。训练和验证精度分别为 0.66 和 0.46。 (plot).

这两者之间的正确方法是什么，为什么准确度不同？

Answer 1

当 class 阈值从 0 变化到 1.0 时，通常通过绘制灵敏度 (TPR) 与特异性 (FPR) 来创建 ROC，例如：在这里查看示例： https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc 一些帮助您入门的伪代码：

pred_proba = model.predict(x_train_df).ravel()

for thresh in np.arange (0, 1, 0.1):
    pred = np.where(pred_proba >thresh ,1,0)

    # assuming you have a truth array of 0,1 classifications
    #now you can assess sensitivy by calculating true positive, false positive,...
    tp= np.count_nonzero(truth & pred)
    # same for false positive, false negative,...
    # they you can evaluate your sensitivity (TPR) and specificity(FPR) for the current threshold
    tpr = (tp / (tp + fn)
    # same for fpr
    # now you can plot the tpr, fpr point for the current treshold value

二元分类精度评估中的疑点

Doubts in accuracy assessment of binary classification

python

machine-learning

auc

keras

tensorflow