使用 10x10 交叉验证时如何计算 ROC？

Question

这个问题与另一个问题相关：How to binarize RandomForest to plot a ROC in python? 我还使用了 Scikit 中可用的代码：ROC multiclass problem

所以我想绘制 ROC。但是当我进行 10x10 交叉验证时，我是否必须计算概率的平均值 ("predict_proba")，因为我将有 100 y_score？而且每一个都是一个3x15的数组？

检查代码中的这一行：

y_score = clf.fit(x_train, y_train).predict_proba(x_test)

代码从这里开始

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

result_list = [] #stores the average of the inner loops - Preliminar
yscore_list = []
clf = Pipeline([('rcl', RobustScaler()),
                ('clf', OneVsRestClassifier(RandomForestClassifier(random_state=0, n_jobs=-1)))])

print("4 epochs x subject in test_size", "\n")
xSSSmean84 = [] # 4 epochs x subject =» test_size=84 o 0.1%
for i in range(1):
    sss = StratifiedShuffleSplit(2, test_size=0.1, random_state=i)
    scoresSSS = model_selection.cross_val_score(clf, X, y, cv=sss)
    xSSSmean84.append(scoresSSS.mean())

    for train_index, test_index in sss.split(X, y):
        x_train, x_test = X[train_index], X[test_index] 
        y_train, y_test = y[train_index], y[test_index]

        y_score = clf.fit(x_train, y_train).predict_proba(x_test) 
        yscore_list.append(y_score)
        print(y_score)
        print("")

这就是 y_score 的样子。通过交叉验证，我会有很多：

[[ 0.   1.   0.1]
 [ 0.   0.   1. ]
 [ 0.   1.   0. ]
 [ 0.   0.   1. ]
 [ 1.   0.   0. ]
 [ 0.   0.   1. ]
 [ 0.   0.   1. ]
 [ 0.   1.   0.1]
 [ 0.   1.   0. ]
 [ 1.   0.   0. ]
 [ 0.   0.   1. ]
 [ 1.   0.   0. ]
 [ 1.   0.   0. ]
 [ 1.   0.   0. ]
 [ 0.   1.   0. ]]

Answer 1

我们来看看y_score的含义：每列包含一个 class 的分数每行代表一个观察结果。

您可能会注意到，对于 StratifiedShuffleSplit（来自 sklearn 文档）： http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

_splits : int, default 10

    Number of re-shuffling & splitting iterations.

您将其设置为 2，因此您将只有 2 个洗牌拆分，每个拆分占总训练大小的 0.1 个观察值。即使重采样没有混洗，您也会在 <20% 的原始数据大小中评估交叉验证结果。您可能得出的任何性能指标，只有当从拆分中获得的 20% 代表剩余的 80% 时，它才代表样本外错误。我建议开始一个交叉验证策略，作为第一步覆盖完整的输入数据集。

因此，您不会得到“10x10”交叉验证，而是得到大小为： n

数据集完整数据集大小为：

Classes     3
Samples per class   50
Samples total   150

因此，当您 select 数据集的 0.1 时，您将得到包含 15 个观察值的折叠，因此 y_score 中有 15 行这就解释了为什么你得到 15x3 的分数。

要推导 ROC，您需要计算每个 classes 的假阳性和真阳性比率（ROC 仅针对二元 class 运算符定义）

您发送的 link 中的以下代码应该有效（列 <->classes）

roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

从 fpr,tpr 开始，您可以通过不同的方式构建 ROC 曲线 multiclass（微观、宏观平均，请参阅 sklearn 文档）。很难就此提出建议，因为它实际上取决于您的 application/metric 兴趣。

仍然，从您选择的任何方法中，您将获得多条 ROC 曲线，每个 stratifiedFold 一条。然后，您可以计算不同折叠的 ROC 的 AUC（或 TPR、FPR）的一些汇总统计数据，例如均值，标准差。这样您就可以估计模型性能及其在未见数据上的稳定性。

使用 10x10 交叉验证时如何计算 ROC？

How to calculate the ROC when using 10x10 cross validation?

python

numpy

random-forest

roc

scikit-learn