如何解释这个三角形的 ROC AUC 曲线?

How to interpret this triangular shape ROC AUC curve?

我有 10 多个特征和一万个案例来训练用于对人种进行分类的逻辑回归。第一个例子是法语 vs 非法语,第二个例子是英语 vs 非英语。结果如下:

//////////////////////////////////////////////////////

1= fr
0= non-fr
Class count:
0    69109
1    30891
dtype: int64
Accuracy: 0.95126
Classification report:
             precision    recall  f1-score   support

          0       0.97      0.96      0.96     34547
          1       0.92      0.93      0.92     15453

avg / total       0.95      0.95      0.95     50000

Confusion matrix:
[[33229  1318]
 [ 1119 14334]]
AUC= 0.944717975754

//////////////////////////////////////////////////////

1= en
0= non-en
Class count:
0    76125
1    23875
dtype: int64
Accuracy: 0.7675
Classification report:
             precision    recall  f1-score   support

          0       0.91      0.78      0.84     38245
          1       0.50      0.74      0.60     11755

avg / total       0.81      0.77      0.78     50000

Confusion matrix:
[[29677  8568]
 [ 3057  8698]]
AUC= 0.757955582999

//////////////////////////////////////////////////////

但是,我得到一些看起来非常奇怪的 AUC 曲线,它们是三角形而不是锯齿状的圆形曲线。关于我为什么会变成这样的形状有什么解释吗?我可能犯了什么错误吗?

代码:

    all_dict = []
    for i in range(0, len(my_dict)):
        temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
            + my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
            + my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items()
            + my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items()
            )
        all_dict.append(temp_dict)

    newX = dv.fit_transform(all_dict)

    # Separate the training and testing data sets
    half_cut = int(len(df)/2.0)*-1
    X_train = newX[:half_cut]
    X_test = newX[half_cut:]
    y_train = y[:half_cut]
    y_test = y[half_cut:]

    # Fitting X and y into model, using training data
    #$$
    lr.fit(X_train, y_train)

    # Making predictions using trained data
    #$$
    y_train_predictions = lr.predict(X_train)
    #$$
    y_test_predictions = lr.predict(X_test)

    #print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
    print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])

    print 'Classification report:'
    print classification_report(y_test, y_test_predictions)
    #print sk_confusion_matrix(y_train, y_train_predictions)
    print 'Confusion matrix:'
    print sk_confusion_matrix(y_test, y_test_predictions)

    #print y_test[1:20]
    #print y_test_predictions[1:20]

    #print y_test[1:10]
    #print np.bincount(y_test)
    #print np.bincount(y_test_predictions)

    # Find and plot AUC
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print 'AUC=',roc_auc

    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.1,1.2])
    plt.ylim([-0.1,1.2])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

你做错了。根据文档:

y_score : array, shape = [n_samples]

    Target scores, can either be probability estimates of the positive class or confidence values.

因此在这一行:

roc_curve(y_test, y_test_predictions)

您应该将 decision_functionroc_curve 函数结果(或 predict_proba 结果的两列中的某些列)传递给 roc_curve 函数结果,而不是实际预测。

看看这些例子http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py