这条 ROC 曲线有意义吗？

Question

此代码 returns 并根据预测值和真实值绘制真阳性率、假阳性率、真阳性计数、假阳性计数：

def get_all_stats(y_true , y_pred) : 

    def perf_measure(y_true, y_pred):

        TP = 0
        FP = 0
        TN = 0
        FN = 0

        for i in range(len(y_true)): 
            if y_true[i] == 1 and y_pred[i] == 1:
                TP += 1
            if y_pred[i]==1 and y_true[i]!=y_pred[i]:
                FP += 1
            if y_true[i]== 0 and y_pred[i]==0:
                TN += 1
            if y_pred[i]==0 and y_true[i] != y_pred[i]:
                FN += 1

        if(FP == 0) : 
            FPR = 0;
        else : 
            FPR = FP / (FP + TN)

        if(TP == 0) : 
            TPR = 0
        else : 
            TPR = TP / (TP + FN)

        return(TN , FPR, FN , TPR , TP , FP)

    tn, fpr, fn, tpr, tp , fp = perf_measure(y_true, y_pred)

    return tpr , fpr , tp , fp

tpr1 , fpr1 , tp1 , fp1 = get_all_stats(y_true=[1,1,1] , y_pred=[1,0,0])
tpr2 , fpr2 , tp2 , fp2 = get_all_stats(y_true=[1,0,1] , y_pred=[0,1,0])
tpr3 , fpr3 , tp3 , fp3 = get_all_stats(y_true=[0,0,0] , y_pred=[1,0,0])

plt.figure(figsize=(12,6))
plt.tick_params(labelsize=12)

print(tpr1 , fpr1 , tp1 , fp1)
print(tpr2 , fpr2 , tp2 , fp2)
print(tpr3 , fpr3 , tp3 , fp3)

plt.plot([fpr1,fpr2,fpr3], [tpr1 , tpr2, tpr3], color='blue', label='')
plt.ylabel("TPR",fontsize=16)
plt.xlabel("FPR",fontsize=16)
plt.legend()

生成的 ROC 图是：

为了模拟三种不同的假阳性和真阳性率以及不同的阈值，通过使用不同的

实现函数 get_all_stats 三次来计算这些值

tpr1 , fpr1 , tp1 , fp1 = get_all_stats(y_true=[1,1,1] , y_pred=[1,0,0])
tpr2 , fpr2 , tp2 , fp2 = get_all_stats(y_true=[1,0,1] , y_pred=[0,1,0])
tpr3 , fpr3 , tp3 , fp3 = get_all_stats(y_true=[0,0,0] , y_pred=[1,0,0])

有 9 个实例被分类为 1 或 0，其中真值是：[1,1,1,1,0,1,0,0,0]

在阈值 1 处，预测值为 [1,0,0]，而在此阈值处的真值是 [1,1,1]。

在阈值 2 处，预测值为 [0,1,0]，而在此阈值处的真值是 [1,0,1]。

在阈值 3 处，预测值为 [1,0,0]，而在此阈值处的真值是 [0,0,0]。

可以看出，生成的分类器的生成图不同于 'typical' ROC 曲线：

随着它首先下降，然后假阳性率和真阳性率下降，导致该线变为 'move back'。我是否正确实施了 ROC 曲线？可以计算这条曲线的 AUC 吗？

Answer 1

好的，有动力去帮忙，因为你有很多代表 -> 帮助了很多其他人。开始了。

这条ROC曲线没有意义。问题是您仅在不同阈值的数据子集上计算 FPR/TPR。在每个阈值处，您应该使用所有的数据来计算 FPR 和 TPR。因此，您的情节中似乎有 3 个点，但 y_true = [1,1,1,1,0,1,0,0,0] 和 y_pred = [1,0,0,0,1,0,1,0,0] 的 FPR/TPR 应该只有一个点。但是，为了确保您拥有实际的 ROC 曲线，您也不能只在不同的阈值处组成 y_pred 值 - 这些需要来自实际预测的概率，然后对其进行适当的阈值处理。我稍微修改了你的代码，因为我喜欢使用 numpy;这是计算 ROC 曲线的方法。

# start with the true labels, as you did
y_true = np.array([1, 1, 1, 1, 0, 1, 0, 0, 0])
# and a predicted probability of each being a "1"
# I just used random numbers for these, but you would get them
# from your classifier
predictions = np.array([
    0.07485627, 0.72546085, 0.60287482,
    0.90537829, 0.75789236, 0.01852192,
    0.85425979, 0.36881312, 0.63893516
])

# now define a set of thresholds (the more thresholds, the better
# the curve will look). There's a smarter way to do this in practice
# (you can sort the predicted probabilities and just have one threshold
# between each), but this is just to help with understanding
thresholds = np.linspace(0, 1, 11) # 0.1, 0.2, ..., 1.0

fprs = []
tprs = []

# we can precompute which inputs are actually 1s/0s and how many of each
true_1_idx = np.where(y_true == 1)[0]
true_0_idx = np.where(y_true == 0)[0]
n_true_1 = len(true_1_idx)
n_true_0 = len(true_0_idx)

for threshold in thresholds:
    # now, for each threshold, we use that on the underlying probabilities
    # to get the actual predicted classes
    pred_classes = predictions >= threshold
    # and compute FPR/TPR from those
    tprs.append((pred_classes[true_1_idx] == 1).sum() / n_true_1)
    fprs.append((pred_classes[true_0_idx] == 1).sum() / n_true_0)

plt.figure(figsize=(12,6))
plt.tick_params(labelsize=12)

plt.plot(fprs, tprs, color='blue')
plt.ylabel("TPR",fontsize=16)
plt.xlabel("FPR",fontsize=16)

请注意，随着 FPR（x 轴）的增加，ROC 曲线始终不会降低 TPR（y 轴）；也就是说，它会随着您向右移动而上升。从阈值的工作原理可以清楚地看出这一点。在阈值为 0 时，所有预测都是“1”，因此我们有 FPR = TPR = 1。增加阈值会减少“1”的预测，因此 FPR 和 TPR 只能保持不变或减少。

请注意，即使我们使用了最佳阈值，曲线中仍然存在跳跃，因为我们的数据量有限，因此我们可以使用任何阈值得到有限数量的不同 TPR/FPR 对.但是，如果您有足够的数据，那么这看起来会很顺利。在这里，我替换了上面代码中的几行以获得更平滑的绘图：

n_points = 1000
y_true = np.random.randint(0, 2, size=n_points)
predictions = np.random.random(n_points)

thresholds = np.linspace(0, 1, 1000)

如果不清楚，0.5 的 AUC 是最差的，你可以看到这就是我们用随机 "predictions" 得到的结果。如果您的 AUC 低于 0.5，您可以将每个预测翻转为优于 0.5（并且您的 model/training 可能有问题）。

如果你真的想在实践中绘制 ROC 曲线，而不是自己写它来学习更多，请使用 sklearn 的 roc_curve. They also have roc_auc_score 为你获取 AUC。

这条 ROC 曲线有意义吗？

Does this ROC curve make sense?

python

statistics

roc

auc

deep-learning