Sklearn 将 fit() 参数传递给管道中的 xgboost
Sklearn pass fit() parameters to xgboost in pipeline
类似于 我只想将参数传递给管道的一部分。通常,它应该可以正常工作:
estimator = XGBClassifier()
pipeline = Pipeline([
('clf', estimator)
])
并像
一样执行
pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20)
但它失败了:
/usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
114 """
115 Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
--> 116 self.steps[-1][-1].fit(Xt, yt, **fit_params)
117 return self
118
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose)
443 early_stopping_rounds=early_stopping_rounds,
444 evals_result=evals_result, obj=obj, feval=feval,
--> 445 verbose_eval=verbose)
446
447 self.objective = xgb_options["objective"]
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, learning_rates, xgb_model, callbacks)
201 evals=evals,
202 obj=obj, feval=feval,
--> 203 xgb_model=xgb_model, callbacks=callbacks)
204
205
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
97 end_iteration=num_boost_round,
98 rank=rank,
---> 99 evaluation_result_list=evaluation_result_list))
100 except EarlyStopException:
101 break
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/callback.py in callback(env)
196 def callback(env):
197 """internal function"""
--> 198 score = env.evaluation_result_list[-1][1]
199 if len(state) == 0:
200 init(env)
IndexError: list index out of range
而
estimator.fit(X_train, y_train, early_stopping_rounds=20)
工作正常。
这是解决方案:https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13755/xgboost-early-stopping-and-other-issues early_stooping_rounds 和监视列表/eval_set 都需要通过。不幸的是,这对我不起作用,因为监视列表中的变量需要一个预处理步骤,该步骤仅适用于管道/我需要手动应用此步骤。
对于早停轮次,您必须始终指定参数 eval_set 给出的验证集。以下是如何修复代码中的错误。
pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20, clf__eval_set=[(test_X, test_y)])
我最近使用以下步骤为 Xgboost 使用 eval 指标和 eval_set 参数。
1。使用 pre-processing/feature 转换步骤创建管道:
这是根据之前定义的管道制作的,其中包括 xgboost 模型作为最后一步。
pipeline_temp = pipeline.Pipeline(pipeline.cost_pipe.steps[:-1])
2。适合此管道
X_trans = pipeline_temp.fit_transform(X_train[FEATURES],y_train)
3。通过将转换应用于测试集
创建您的 eval_set
eval_set = [(X_trans, y_train), (pipeline_temp.transform(X_test), y_test)]
4。将你的 xgboost 步骤添加回流水线
pipeline_temp.steps.append(pipeline.cost_pipe.steps[-1])
5。通过传递参数
来适应新管道
pipeline_temp.fit(X_train[FEATURES], y_train,
xgboost_model__eval_metric = ERROR_METRIC,
xgboost_model__eval_set = eval_set)
6。如果您愿意,请保留管道。
joblib.dump(pipeline_temp, save_path)
这是一个在 GridSearchCV 管道中工作的解决方案:
Over-ride XGBRegressor 或 XGBClssifier.fit() 函数
- 这一步使用train_test_split()来select指定数量
来自 X 的 eval_set 的验证记录,然后通过
剩余记录以及 fit()。
- .fit() 添加了一个新参数eval_test_size 来控制验证记录的数量。 (参见 train_test_split test_size 文档)
- **kwargs 传递用户为 XGBRegressor.fit() 函数添加的任何其他参数。
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split
class XGBRegressor_ES(XGBRegressor):
def fit(self, X, y, *, eval_test_size=None, **kwargs):
if eval_test_size is not None:
params = super(XGBRegressor, self).get_xgb_params()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=eval_test_size, random_state=params['random_state'])
eval_set = [(X_test, y_test)]
# Could add (X_train, y_train) to eval_set
# to get .eval_results() for both train and test
#eval_set = [(X_train, y_train),(X_test, y_test)]
kwargs['eval_set'] = eval_set
return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs)
用法示例
下面是一个多步流水线,包括对 X 的多个转换。流水线的 fit() 函数将新的评估参数传递给上面的 XGBRegressor_ES class 作为 xgbr__eval_test_size=200 .在这个例子中:
- X_train 包含传递到管道的文本文档。
- XGBRegressor_ES.fit() 使用 train_test_split() 到 select 来自 X_train 的 200 条记录用于验证集和提前停止。 (这也可以是百分比,例如 xgbr__eval_test_size=0.2)
- X_train 中的剩余记录将传递给 XGBRegressor.fit() 以进行实际拟合 ()。
- 对于网格搜索中的每个 cv 折叠,在 75 轮不变的提升后现在可能会发生提前停止。
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('vt',VarianceThreshold()),
('scaler', StandardScaler()),
('Sp', SelectPercentile()),
('xgbr',XGBRegressor_ES(n_estimators=2000,
objective='reg:squarederror',
eval_metric='mae',
learning_rate=0.0001,
random_state=7)) ])
X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values
管道拟合示例:
%time xgbr_pipe.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
示例拟合 GridSearchCV:
learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)
grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
类似于
estimator = XGBClassifier()
pipeline = Pipeline([
('clf', estimator)
])
并像
一样执行pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20)
但它失败了:
/usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
114 """
115 Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
--> 116 self.steps[-1][-1].fit(Xt, yt, **fit_params)
117 return self
118
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose)
443 early_stopping_rounds=early_stopping_rounds,
444 evals_result=evals_result, obj=obj, feval=feval,
--> 445 verbose_eval=verbose)
446
447 self.objective = xgb_options["objective"]
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, learning_rates, xgb_model, callbacks)
201 evals=evals,
202 obj=obj, feval=feval,
--> 203 xgb_model=xgb_model, callbacks=callbacks)
204
205
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
97 end_iteration=num_boost_round,
98 rank=rank,
---> 99 evaluation_result_list=evaluation_result_list))
100 except EarlyStopException:
101 break
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/callback.py in callback(env)
196 def callback(env):
197 """internal function"""
--> 198 score = env.evaluation_result_list[-1][1]
199 if len(state) == 0:
200 init(env)
IndexError: list index out of range
而
estimator.fit(X_train, y_train, early_stopping_rounds=20)
工作正常。
这是解决方案:https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13755/xgboost-early-stopping-and-other-issues early_stooping_rounds 和监视列表/eval_set 都需要通过。不幸的是,这对我不起作用,因为监视列表中的变量需要一个预处理步骤,该步骤仅适用于管道/我需要手动应用此步骤。
对于早停轮次,您必须始终指定参数 eval_set 给出的验证集。以下是如何修复代码中的错误。
pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20, clf__eval_set=[(test_X, test_y)])
我最近使用以下步骤为 Xgboost 使用 eval 指标和 eval_set 参数。
1。使用 pre-processing/feature 转换步骤创建管道:
这是根据之前定义的管道制作的,其中包括 xgboost 模型作为最后一步。
pipeline_temp = pipeline.Pipeline(pipeline.cost_pipe.steps[:-1])
2。适合此管道
X_trans = pipeline_temp.fit_transform(X_train[FEATURES],y_train)
3。通过将转换应用于测试集
创建您的 eval_seteval_set = [(X_trans, y_train), (pipeline_temp.transform(X_test), y_test)]
4。将你的 xgboost 步骤添加回流水线
pipeline_temp.steps.append(pipeline.cost_pipe.steps[-1])
5。通过传递参数
来适应新管道pipeline_temp.fit(X_train[FEATURES], y_train,
xgboost_model__eval_metric = ERROR_METRIC,
xgboost_model__eval_set = eval_set)
6。如果您愿意,请保留管道。
joblib.dump(pipeline_temp, save_path)
这是一个在 GridSearchCV 管道中工作的解决方案:
Over-ride XGBRegressor 或 XGBClssifier.fit() 函数
- 这一步使用train_test_split()来select指定数量 来自 X 的 eval_set 的验证记录,然后通过 剩余记录以及 fit()。
- .fit() 添加了一个新参数eval_test_size 来控制验证记录的数量。 (参见 train_test_split test_size 文档)
- **kwargs 传递用户为 XGBRegressor.fit() 函数添加的任何其他参数。
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split
class XGBRegressor_ES(XGBRegressor):
def fit(self, X, y, *, eval_test_size=None, **kwargs):
if eval_test_size is not None:
params = super(XGBRegressor, self).get_xgb_params()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=eval_test_size, random_state=params['random_state'])
eval_set = [(X_test, y_test)]
# Could add (X_train, y_train) to eval_set
# to get .eval_results() for both train and test
#eval_set = [(X_train, y_train),(X_test, y_test)]
kwargs['eval_set'] = eval_set
return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs)
用法示例
下面是一个多步流水线,包括对 X 的多个转换。流水线的 fit() 函数将新的评估参数传递给上面的 XGBRegressor_ES class 作为 xgbr__eval_test_size=200 .在这个例子中:
- X_train 包含传递到管道的文本文档。
- XGBRegressor_ES.fit() 使用 train_test_split() 到 select 来自 X_train 的 200 条记录用于验证集和提前停止。 (这也可以是百分比,例如 xgbr__eval_test_size=0.2)
- X_train 中的剩余记录将传递给 XGBRegressor.fit() 以进行实际拟合 ()。
- 对于网格搜索中的每个 cv 折叠,在 75 轮不变的提升后现在可能会发生提前停止。
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('vt',VarianceThreshold()),
('scaler', StandardScaler()),
('Sp', SelectPercentile()),
('xgbr',XGBRegressor_ES(n_estimators=2000,
objective='reg:squarederror',
eval_metric='mae',
learning_rate=0.0001,
random_state=7)) ])
X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values
管道拟合示例:
%time xgbr_pipe.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
示例拟合 GridSearchCV:
learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)
grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)