将 XGboost 回归器与 sklearn learning_curve 结合使用时的性能指标
Performance metric when using XGboost regressor with sklearn learning_curve
我创建了 xgboost 回归模型,想看看训练和测试性能如何随着训练集数量的增加而变化。
xgbm_reg = XGBRegressor()
tr_sizes, tr_scs, test_scs = learning_curve(estimator=xgbm_reg,
X=ori_X,y=y,
train_sizes=np.linspace(0.1, 1, 5),
cv=5)
tr_scs 和 test_scs 的性能如何?
Sklearn doc 告诉我
scoring : str or callable, default=None
A str (see model evaluation documentation) or a scorer callable object / function
with signature scorer(estimator, X, y)
所以我查看了 XGboost documentation,它说 objective 是 default = reg:squarederror
这是否意味着 tr_scs 和 test_scs 的结果是根据平方误差?
我想用cross_val_score
检查一下
scoring = "neg_mean_squared_error"
cv_results = cross_val_score(xgbm_reg, ori_X, y, cv=5, scoring=scoring)
但是不太确定如何从 cross_val_score
获取 squared_error
XGBRegressor
的内置记分器是 R 平方,这是 learning_curve
和 cross_val_score
中使用的默认记分器,请参见下面的代码。
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import learning_curve, cross_val_score, KFold
from sklearn.metrics import r2_score
# generate the data
X, y = make_regression(n_features=10, random_state=100)
# generate 5 CV splits
kf = KFold(n_splits=5, shuffle=False)
# calculate the CV scores using `learning_curve`, use 100% train size for comparison purposes
_, _, lc_scores = learning_curve(estimator=XGBRegressor(), X=X, y=y, train_sizes=[1.0], cv=kf)
print(lc_scores)
# [[0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]]
# calculate the CV scores using `cross_val_score`
cv_scores = cross_val_score(estimator=XGBRegressor(), X=X, y=y, cv=kf)
print(cv_scores)
# [0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]
# calculate the CV scores manually
xgb_scores = []
r2_scores = []
# iterate across the CV splits
for train_index, test_index in kf.split(X):
# extract the training and test data
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# fit the model to the training data
estimator = XGBRegressor()
estimator.fit(X_train, y_train)
# score the test data using the XGBRegressor built-in scorer
xgb_scores.append(estimator.score(X_test, y_test))
# score the test data using the R-squared
y_pred = estimator.predict(X_test)
r2_scores.append(r2_score(y_test, y_pred))
print(xgb_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]
print(r2_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]
我创建了 xgboost 回归模型,想看看训练和测试性能如何随着训练集数量的增加而变化。
xgbm_reg = XGBRegressor()
tr_sizes, tr_scs, test_scs = learning_curve(estimator=xgbm_reg,
X=ori_X,y=y,
train_sizes=np.linspace(0.1, 1, 5),
cv=5)
tr_scs 和 test_scs 的性能如何?
Sklearn doc 告诉我
scoring : str or callable, default=None
A str (see model evaluation documentation) or a scorer callable object / function
with signature scorer(estimator, X, y)
所以我查看了 XGboost documentation,它说 objective 是 default = reg:squarederror
这是否意味着 tr_scs 和 test_scs 的结果是根据平方误差?
我想用cross_val_score
检查一下scoring = "neg_mean_squared_error"
cv_results = cross_val_score(xgbm_reg, ori_X, y, cv=5, scoring=scoring)
但是不太确定如何从 cross_val_score
获取 squared_errorXGBRegressor
的内置记分器是 R 平方,这是 learning_curve
和 cross_val_score
中使用的默认记分器,请参见下面的代码。
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import learning_curve, cross_val_score, KFold
from sklearn.metrics import r2_score
# generate the data
X, y = make_regression(n_features=10, random_state=100)
# generate 5 CV splits
kf = KFold(n_splits=5, shuffle=False)
# calculate the CV scores using `learning_curve`, use 100% train size for comparison purposes
_, _, lc_scores = learning_curve(estimator=XGBRegressor(), X=X, y=y, train_sizes=[1.0], cv=kf)
print(lc_scores)
# [[0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]]
# calculate the CV scores using `cross_val_score`
cv_scores = cross_val_score(estimator=XGBRegressor(), X=X, y=y, cv=kf)
print(cv_scores)
# [0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]
# calculate the CV scores manually
xgb_scores = []
r2_scores = []
# iterate across the CV splits
for train_index, test_index in kf.split(X):
# extract the training and test data
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# fit the model to the training data
estimator = XGBRegressor()
estimator.fit(X_train, y_train)
# score the test data using the XGBRegressor built-in scorer
xgb_scores.append(estimator.score(X_test, y_test))
# score the test data using the R-squared
y_pred = estimator.predict(X_test)
r2_scores.append(r2_score(y_test, y_pred))
print(xgb_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]
print(r2_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]