比较 2 个 ML 模型的性能准确性之间的差异是否具有统计显着性

Question

这是我第一次使用堆栈交换，但我需要帮助解决一个问题（这不是家庭作业或作业问题）：

我有两个决策树：D1 = DecisionTreeClassifier(max_depth=4,criterion = 'entropy',random_state=1) 和 D2 = DecisionTreeClassifier(max_depth=8,criterion = 'entropy',random_state=1)。当我针对给定的一组特征和相应标签对它们进行 5 折交叉验证时，我发现它们在 5 折上的平均验证准确率分别为 0.59 和 0.57。我如何确定他们的表现之间的差异是否具有统计显着性？（P.S。我们要使用显着性水平 = 0.01）。

如果此处缺少任何信息或重要术语，请说明。

Answer 1

这是一个很好的问题，事实证明答案并不那么简单。

大多数人会本能地推荐 Student's paired t-test; but, as explained in the excellent post Statistical Significance Tests for Comparing Machine Learning Algorithms 精通机器学习，但这个测试实际上并不适合这种情况，因为它的假设实际上被违反了：

In fact, this [Student's t-test] is a common way to compare classifiers with perhaps hundreds of published papers using this methodology.

The problem is, a key assumption of the paired Student’s t-test has been violated.

Namely, the observations in each sample are not independent. As part of the k-fold cross-validation procedure, a given observation will be used in the training dataset (k-1) times. This means that the estimated skill scores are dependent, not independent, and in turn that the calculation of the t-statistic in the test will be misleadingly wrong along with any interpretations of the statistic and p-value.

文章继续推荐 McNemar 的测试（另请参阅 this, now closed, SO question), which is implemented in the statsmodels Python 包。我不会假装对此一无所知，我从未使用过它，因此您可能需要在这里自己进一步挖掘...

然而，正如上述 post 所报告的，Student 的 t 检验可以是 "last resort" 方法：

It’s an option, but it’s very weakly recommended.

这就是我要在这里展示的内容；谨慎使用。

首先，您不仅需要平均值，还需要交叉验证的每个 k 折中的性能指标的实际值。这在 scikit-learn 中并不是微不足道的，但我最近回答了一个关于的相关问题，我将在此处使用 scikit-learn 的波士顿数据集和两个决策树回归器调整答案（你当然可以将它们调整为你自己的具体情况）：

from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
model_1 = DecisionTreeRegressor(max_depth = 4, criterion='mae',random_state=1)
model_2 = DecisionTreeRegressor(max_depth = 8, criterion='mae', random_state=1)

cv_mae_1 = []
cv_mae_2 = []

for train_index, val_index in kf.split(X):
    model_1.fit(X[train_index], y[train_index])
    pred_1 = model_1.predict(X[val_index])
    err_1 = mean_absolute_error(y[val_index], pred_1)
    cv_mae_1.append(err_1)

    model_2.fit(X[train_index], y[train_index])
    pred_2 = model_2.predict(X[val_index])
    err_2 = mean_absolute_error(y[val_index], pred_2)
    cv_mae_2.append(err_2)

cv_mae_1 包含我们第一个模型的 5 个折叠中的每一个的度量值（此处表示绝对误差 - MAE）：

cv_mae_1
# result:
[3.080392156862745,
 2.8262376237623767,
 3.164851485148514,
 3.5514851485148515,
 3.162376237623762]

和类似的 cv_mae_2 我们的第二个模型：

cv_mae_2
# result
[3.1460784313725494,
 3.288613861386139,
 3.462871287128713,
 3.143069306930693,
 3.2490099009900986]

获得这些列表后，现在可以直接使用 scipy:

的相应方法计算配对 t 检验统计量以及相应的 p 值

from scipy import stats
stats.ttest_rel(cv_mae_1,cv_mae_2)
# Ttest_relResult(statistic=-0.6875659723031529, pvalue=0.5295196273427171)

在我们的例子中，巨大的 p 值意味着 在我们的 MAE 指标的均值之间没有 统计显着差异。

希望这会有所帮助 - 不要犹豫，自己深入挖掘...

比较 2 个 ML 模型的性能准确性之间的差异是否具有统计显着性

Compare whether the difference between performance accuracy of 2 ML models is Statistically Significant or Not

python

statistics

machine-learning

scikit-learn

cross-validation