交叉验证中第一次拆分的错误高于其余拆分

Question

我正在尝试使用以下代码使用 5 折交叉验证来评估不同的回归技术：

from sklearn.linear_model import Ridge, MultiTaskLasso as Lasso, ElasticNet as Elastic  
from sklearn.model_selection import KFold

classifiers = [Ridge, Lasso, Elastic]

kf = KFold(n_splits=5)
splits = kf.split(x_bow)

for classifier in classifiers:
    name = classifier.__name__

    for i, (train_idx, test_idx) in enumerate(splits):
        clf = classifier(alpha=1)

        x_train_split = x_bow[train_idx,:]
        y_train_split = y_np[train_idx,:]
        x_test_split = x_bow[test_idx,:]
        y_test_split = y_np[test_idx,:]

        clf.fit(x_train_split, y_train_split)
        prediction = clf.predict(x_test_split)
        mae = np.mean(np.abs(prediction - y_test_split), axis=1)
        print(f'{name} - split {i+1} - points mae {mae[0]:.2f} price mae {mae[1]:.2f}')

这会产生以下结果：

Ridge - split 1 - points mae 3.22 price mae 1.71
Ridge - split 2 - points mae 0.47 price mae 0.41
Ridge - split 3 - points mae 0.23 price mae 0.11
Ridge - split 4 - points mae 0.11 price mae 0.20
Ridge - split 5 - points mae 0.36 price mae 0.67
MultiTaskLasso - split 1 - points mae 4.09 price mae 2.37
MultiTaskLasso - split 2 - points mae 0.26 price mae 0.20
MultiTaskLasso - split 3 - points mae 0.48 price mae 0.36
MultiTaskLasso - split 4 - points mae 0.39 price mae 0.28
MultiTaskLasso - split 5 - points mae 0.45 price mae 0.73
ElasticNet - split 1 - points mae 4.09 price mae 2.37
ElasticNet - split 2 - points mae 0.26 price mae 0.20
ElasticNet - split 3 - points mae 0.48 price mae 0.36
ElasticNet - split 4 - points mae 0.39 price mae 0.28
ElasticNet - split 5 - points mae 0.45 price mae 0.73

在查看输出时，我怀疑分类器在第一次拆分后得到的错误率较低，因为它对之前已经学习过的拆分进行了评估。但是，我确实在 for 循环中创建了一个新分类器，因此它应该为分类器创建一个新对象。（所以第一次分裂应该不会影响其他人）。

我的问题是：为什么第一个分裂的错误率比其他分裂高，我该如何解决这个问题。

Answer 1

数据没有打乱，异常值在数据集的开头。

交叉验证中第一次拆分的错误高于其余拆分

Error of first split in cross validation is higher than rest of splits

python

regression

scikit-learn

cross-validation