cv_results_ 和 best_score_ 中的考试成绩在 scikit-optimize 中是如何计算的?
How are the test scores in cv_results_ and best_score_ calculated in scikit-optimize?
我正在使用 scikit-optimize
中的 BayesSearchCV
来优化 XGBoost
模型以适应我拥有的一些数据。虽然模型拟合得很好,但我对诊断信息中提供的分数感到困惑,无法复制它们。
这是一个使用波士顿房价数据集来说明我的观点的示例脚本:
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
from xgboost.sklearn import XGBRegressor
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import KFold, train_test_split
boston = load_boston()
# Dataset info:
print(boston.keys())
print(boston.data.shape)
print(boston.feature_names)
print(boston.DESCR)
# Put data into dataframe and label column headers:
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
# Add target variable to dataframe
data['PRICE'] = boston.target
# Split into X and y
X, y = data.iloc[:, :-1],data.iloc[:,-1]
# Split into training and validation datasets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, shuffle = True)
# For cross-validation, split training data into 5 folds
xgb_kfold = KFold(n_splits = 5,random_state = 42)
# Run fit
xgb_params = {'n_estimators': Integer(10, 3000, 'uniform'),
'max_depth': Integer(2, 100, 'uniform'),
'subsample': Real(0.25, 1.0, 'uniform'),
'learning_rate': Real(0.0001, 0.5, 'uniform'),
'gamma': Real(0.0001, 1.0, 'uniform'),
'colsample_bytree': Real(0.0001, 1.0, 'uniform'),
'colsample_bylevel': Real(0.0001, 1.0, 'uniform'),
'colsample_bynode': Real(0.0001, 1.0, 'uniform'),
'min_child_weight': Real(1, 6, 'uniform')}
xgb_fit_params = {'early_stopping_rounds': 15, 'eval_metric': 'mae', 'eval_set': [[X_val, y_val]]}
xgb_pipe = XGBRegressor(random_state = 42, objective='reg:squarederror', n_jobs = 10)
xgb_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_iter = 5, n_jobs = 1, random_state = 42, verbose = 4, scoring = None, fit_params = xgb_fit_params)
xgb_cv.fit(X_train, y_train)
经过运行这个,xgb_cv.best_score_
是0.816,xgb_cv.best_index_
是3,看xgb_cv.cv_results_,想求出每折最好的分数:
print(xgb_cv.cv_results_['split0_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split1_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split2_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split3_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split4_test_score'][xgb_cv.best_index_])
给出:
0.8023562337946979,
0.8337404778903412,
0.861370681263761,
0.8749312273014963,
0.7058815015739375
我不确定这里计算的是什么,因为在我的代码中 scoring
设置为 None
。 XGBoost 的文档没有太大帮助,但根据 xgb_cv.best_estimator_.score?
它应该是预测值的 R2。无论如何,当我手动尝试计算拟合中使用的每一折数据的分数时,我无法获得这些值:
# First, need to get the actual indices of the data from each fold:
kfold_indexes = {}
kfold_cnt = 0
for train_index, test_index in xgb_kfold.split(X_train):
kfold_indexes[kfold_cnt] = {'train': train_index, 'test': test_index}
kfold_cnt = kfold_cnt+1
# Next, calculate the score for each fold
for p in range(5): print(xgb_cv.best_estimator_.score(X_train.iloc[kfold_indexes[p]['test']], y_train.iloc[kfold_indexes[p]['test']]))
这给了我以下信息:
0.9954929618573786
0.994844803666101
0.9963108152027245
0.9962274544089832
0.9931314653538819
BayesSearchCV 如何计算每次折叠的分数,为什么我不能使用 score
函数复制它们?如果您对此问题有任何帮助,我将不胜感激。
(另外,手动计算这些分数的平均值给出:0.8156560...,而 xgb_cv.best_score_
给出:0.8159277...不知道为什么这里有精度差异。)
best_estimator_
是重新拟合的估计量,在选择超参数后拟合到整个训练集上;因此,在训练集的任何部分对其进行评分都会存在乐观偏差。要重现 cv_results_
,您需要将估计器重新拟合到每个训练折叠和 score
相应的测试折叠。
除此之外,XGBoost random_state
似乎没有涵盖更多的随机性。还有一个参数seed
;为我产生一致结果的设置。 (这里有一些较旧的帖子 (example) 报告了类似的问题,即使设置了 seed
,但也许这些问题已经被较新版本的 xgb 解决了。)
我正在使用 scikit-optimize
中的 BayesSearchCV
来优化 XGBoost
模型以适应我拥有的一些数据。虽然模型拟合得很好,但我对诊断信息中提供的分数感到困惑,无法复制它们。
这是一个使用波士顿房价数据集来说明我的观点的示例脚本:
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
from xgboost.sklearn import XGBRegressor
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import KFold, train_test_split
boston = load_boston()
# Dataset info:
print(boston.keys())
print(boston.data.shape)
print(boston.feature_names)
print(boston.DESCR)
# Put data into dataframe and label column headers:
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
# Add target variable to dataframe
data['PRICE'] = boston.target
# Split into X and y
X, y = data.iloc[:, :-1],data.iloc[:,-1]
# Split into training and validation datasets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, shuffle = True)
# For cross-validation, split training data into 5 folds
xgb_kfold = KFold(n_splits = 5,random_state = 42)
# Run fit
xgb_params = {'n_estimators': Integer(10, 3000, 'uniform'),
'max_depth': Integer(2, 100, 'uniform'),
'subsample': Real(0.25, 1.0, 'uniform'),
'learning_rate': Real(0.0001, 0.5, 'uniform'),
'gamma': Real(0.0001, 1.0, 'uniform'),
'colsample_bytree': Real(0.0001, 1.0, 'uniform'),
'colsample_bylevel': Real(0.0001, 1.0, 'uniform'),
'colsample_bynode': Real(0.0001, 1.0, 'uniform'),
'min_child_weight': Real(1, 6, 'uniform')}
xgb_fit_params = {'early_stopping_rounds': 15, 'eval_metric': 'mae', 'eval_set': [[X_val, y_val]]}
xgb_pipe = XGBRegressor(random_state = 42, objective='reg:squarederror', n_jobs = 10)
xgb_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_iter = 5, n_jobs = 1, random_state = 42, verbose = 4, scoring = None, fit_params = xgb_fit_params)
xgb_cv.fit(X_train, y_train)
经过运行这个,xgb_cv.best_score_
是0.816,xgb_cv.best_index_
是3,看xgb_cv.cv_results_,想求出每折最好的分数:
print(xgb_cv.cv_results_['split0_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split1_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split2_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split3_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split4_test_score'][xgb_cv.best_index_])
给出:
0.8023562337946979,
0.8337404778903412,
0.861370681263761,
0.8749312273014963,
0.7058815015739375
我不确定这里计算的是什么,因为在我的代码中 scoring
设置为 None
。 XGBoost 的文档没有太大帮助,但根据 xgb_cv.best_estimator_.score?
它应该是预测值的 R2。无论如何,当我手动尝试计算拟合中使用的每一折数据的分数时,我无法获得这些值:
# First, need to get the actual indices of the data from each fold:
kfold_indexes = {}
kfold_cnt = 0
for train_index, test_index in xgb_kfold.split(X_train):
kfold_indexes[kfold_cnt] = {'train': train_index, 'test': test_index}
kfold_cnt = kfold_cnt+1
# Next, calculate the score for each fold
for p in range(5): print(xgb_cv.best_estimator_.score(X_train.iloc[kfold_indexes[p]['test']], y_train.iloc[kfold_indexes[p]['test']]))
这给了我以下信息:
0.9954929618573786
0.994844803666101
0.9963108152027245
0.9962274544089832
0.9931314653538819
BayesSearchCV 如何计算每次折叠的分数,为什么我不能使用 score
函数复制它们?如果您对此问题有任何帮助,我将不胜感激。
(另外,手动计算这些分数的平均值给出:0.8156560...,而 xgb_cv.best_score_
给出:0.8159277...不知道为什么这里有精度差异。)
best_estimator_
是重新拟合的估计量,在选择超参数后拟合到整个训练集上;因此,在训练集的任何部分对其进行评分都会存在乐观偏差。要重现 cv_results_
,您需要将估计器重新拟合到每个训练折叠和 score
相应的测试折叠。
除此之外,XGBoost random_state
似乎没有涵盖更多的随机性。还有一个参数seed
;为我产生一致结果的设置。 (这里有一些较旧的帖子 (example) 报告了类似的问题,即使设置了 seed
,但也许这些问题已经被较新版本的 xgb 解决了。)