cross_validate() 和我自己的交叉验证函数之间的不同结果

Question

在使用 cross_validate 验证我的回归模型的性能后，我根据 'r2' 评分获得了一些结果。

这就是我的代码所做的

scores = cross_validate(RandomForestRegressor(),X,y,cv=5,scoring='r2')

而我得到的是

>>scores['test_score']

array([0.47146303, 0.47492019, 0.49350646, 0.56479323, 0.56897343])

为了更加灵活，我还编写了自己的交叉验证函数，如下所示

def my_cross_val(estimator, X, y):
    
    r2_scores = []
    
    kf = KFold(shuffle=True)
    
    for train_index, test_index in kf.split(X,y):
        
        estimator.fit(X.iloc[train_index].values, y.iloc[train_index].values)
        preds = estimator.predict(X.iloc[test_index].values)
                
        r2 = r2_score(y.iloc[test_index].values, preds)
                    
        r2_scores.append(r2)
        
    return np.array(r2_scores)

运行现在

scores = my_cross_val(RandomForestRegressor(),X,y)

我得到

array([0.6975932 , 0.68211856, 0.62892119, 0.64776752, 0.66046326])

我是不是做错了什么

my_cross_val()

因为与 cross_validate() 相比，这些值似乎被高估了？也许把 shuffle=True 放在 KFold 里面？

Answer 1

为了确保您是在同类比较，并且考虑到洗牌可以在这种情况下产生巨大差异，您应该这样做：

首先，手动打乱数据：

from sklearn.utils import shuffle
X_s, y_s = shuffle(X, y, random_state=42)

然后，运行 cross_validate 使用这些打乱的数据：

scores = cross_validate(RandomForestRegressor(),X_s, y_s, cv=5, scoring='r2')

更改要使用的函数

kf = KFold(shuffle=False) # no more shuffling (although it should not hurt)

和运行它与已经洗牌的数据：

scores = my_cross_val(RandomForestRegressor(), X_s, y_s)

现在结果应该相似 - 但还不完全相同。如果您已经在函数之前（和函数之外）定义了 kf = KFold(shuffle=False, random_state=0)，并且运行 cross_validate 定义为

，则可以将它们变成相同的

scores = cross_validate(RandomForestRegressor(), X_s, y_s, cv=kf, scoring='r2') # cv=kf

即在这两种情况下使用完全相同的 CV 分区（您还应该将相同的 random_state 设置为函数内的 kf 定义）。

cross_validate() 和我自己的交叉验证函数之间的不同结果

Different results between cross_validate() and my own cross validation function

python

machine-learning

scikit-learn

cross-validation