将 XGboost 回归器与 sklearn learning_curve 结合使用时的性能指标

Question

我创建了 xgboost 回归模型，想看看训练和测试性能如何随着训练集数量的增加而变化。

xgbm_reg = XGBRegressor()
tr_sizes, tr_scs, test_scs = learning_curve(estimator=xgbm_reg,
                                           X=ori_X,y=y,
                                           train_sizes=np.linspace(0.1, 1, 5),
                                           cv=5)

tr_scs 和 test_scs 的性能如何？

Sklearn doc 告诉我

scoring : str or callable, default=None

    A str (see model evaluation documentation) or a scorer callable object / function
 with signature scorer(estimator, X, y)

所以我查看了 XGboost documentation，它说 objective 是 default = reg:squarederror 这是否意味着 tr_scs 和 test_scs 的结果是根据平方误差？

我想用cross_val_score

检查一下

scoring = "neg_mean_squared_error"
cv_results = cross_val_score(xgbm_reg, ori_X, y, cv=5, scoring=scoring)

但是不太确定如何从 cross_val_score

获取 squared_error

Answer 1

XGBRegressor 的内置记分器是 R 平方，这是 learning_curve 和 cross_val_score 中使用的默认记分器，请参见下面的代码。

from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import learning_curve, cross_val_score, KFold
from sklearn.metrics import r2_score

# generate the data
X, y = make_regression(n_features=10, random_state=100)

# generate 5 CV splits
kf = KFold(n_splits=5, shuffle=False)

# calculate the CV scores using `learning_curve`, use 100% train size for comparison purposes
_, _, lc_scores = learning_curve(estimator=XGBRegressor(), X=X, y=y, train_sizes=[1.0], cv=kf)
print(lc_scores)
# [[0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]]

# calculate the CV scores using `cross_val_score`
cv_scores = cross_val_score(estimator=XGBRegressor(), X=X, y=y, cv=kf)
print(cv_scores)
# [0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]

# calculate the CV scores manually
xgb_scores = []
r2_scores = []

# iterate across the CV splits
for train_index, test_index in kf.split(X):

    # extract the training and test data
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # fit the model to the training data
    estimator = XGBRegressor()
    estimator.fit(X_train, y_train)

    # score the test data using the XGBRegressor built-in scorer
    xgb_scores.append(estimator.score(X_test, y_test))

    # score the test data using the R-squared
    y_pred = estimator.predict(X_test)
    r2_scores.append(r2_score(y_test, y_pred))

print(xgb_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]

print(r2_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]

将 XGboost 回归器与 sklearn learning_curve 结合使用时的性能指标

Performance metric when using XGboost regressor with sklearn learning_curve

python

scikit-learn

xgboost