使用 Scikit 管道的多输入输出模型随机搜索

MultiInputOutput Model RandomSearch with Scikit Pipelines

我正在尝试比较预测问题的不同回归策略:

多输入输出包装器的 scikit 文档实际上不是那么好,但提到:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html

set_params(**params)[source]¶
Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). 
The latter have parameters of the form <component>__<parameter> so that it’s possible to
update each component of a nested object.

因此我正在构建我的管道:

pipeline_xgboost = Pipeline([('scaler', StandardScaler()),
                             ('variance_selector', VarianceThreshold(threshold=0.03)), 
                             ('estimator', xgb.XGBRegressor())])

然后将包装器创建为:

mimo_wrapper = MultiOutputRegressor(pipeline_xgboost)

根据 scikit 管道的文档,我将 xgboost 参数定义为:

parameters = [
    {
        'estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
        'estimator__max_depth': [10, 100, 1000]
         etc...
    }

然后我 运行 我的交叉验证为:

randomized_search = RandomizedSearchCV(mimo_wrapper, perparameters, random_state=0, n_iter=5,
                                       n_jobs=-1, refit=True, cv=3, verbose=True,
                                       pre_dispatch='2*n_jobs', error_score='raise', 
                                       return_train_score=True,
                                       scoring='neg_mean_absolute_error')

但是我遇到了以下问题:

ValueError: Invalid parameter reg_alpha for estimator Pipeline(steps=[('scaler', StandardScaler()),
                ('variance_selector', VarianceThreshold(threshold=0.03)),
                ('estimator',
                 XGBRegressor(base_score=None, booster=None,
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, gamma=None, gpu_id=None,
                              importance_type='gain',
                              interaction_constraints=None, learning_rate=None,
                              max_delta_step=None, max_depth=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, n_estimators=100,
                              n_jobs=None, num_parallel_tree=None,
                              random_state=None, reg_alpha=None,
                              reg_lambda=None, scale_pos_weight=None,
                              subsample=None, tree_method=None,
                              validate_parameters=None, verbosity=None))]). Check the list of available parameters with `estimator.get_params().keys()`.

我是否误解了 scikit 的文档?我也尝试过将参数设置为 estimator__estimator__param,因为这可能是在 mimo_wrapper 中访问参数的方法,但这被证明是不成功的。 (以下示例):

parameters = {
    'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    'estimator__estimator__max_depth': [10, 100, 1000]
}


random_grid = RandomizedSearchCV(estimator=pipeline_xgboost, param_distributions=parameters,random_state=0, n_iter=5,
                                       n_jobs=-1, refit=True, cv=3, verbose=True,
                                       pre_dispatch='2*n_jobs', error_score='raise', 
                                       return_train_score=True,
                                       scoring='neg_mean_absolute_error')

hyperparameters_tuning = random_grid.fit(df.drop(columns=TARGETS+UMAPS),
                              df[TARGETS])
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_11898/2539017483.py in <module>
----> 1 hyperparameters_tuning = random_grid.fit(final_file_df_with_aggregates.drop(columns=TARGETS+UMAPS),
      2                               final_file_df_with_aggregates[TARGETS])

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    889                 return results
    890 
--> 891             self._run_search(evaluate_candidates)
    892 
    893             # multimetric is determined here because in the case of a callable

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1764     def _run_search(self, evaluate_candidates):
   1765         """Search n_iter candidates from param_distributions"""
-> 1766         evaluate_candidates(
   1767             ParameterSampler(
   1768                 self.param_distributions, self.n_iter, random_state=self.random_state

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
    836                     )
    837 
--> 838                 out = parallel(
    839                     delayed(_fit_and_score)(
    840                         clone(base_estimator),

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1054 
   1055             with self._backend.retrieval_context():
-> 1056                 self.retrieve()
   1057             # Make sure that we get a last message telling us we are done
   1058             elapsed_time = time.time() - self._start_time

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
    933             try:
    934                 if getattr(self._backend, 'supports_timeout', False):
--> 935                     self._output.extend(job.get(timeout=self.timeout))
    936                 else:
    937                     self._output.extend(job.get())

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    437                 raise CancelledError()
    438             elif self._state == FINISHED:
--> 439                 return self.__get_result()
    440             else:
    441                 raise TimeoutError()

/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result

有趣的是,我注意到在随机搜索函数之外设置估算器参数时效果很好:

parameters = dict({
    'estimator__max_depth': [10, 100, 1000]
})

mimo_wrapper.estimator.set_params(estimator__max_depth=200)

如您所见,max_depth 现已更改。

Pipeline(steps=[('scaler', StandardScaler()),
                ('variance_selector', VarianceThreshold(threshold=0.03)),
                ('estimator',
                 XGBRegressor(base_score=None, booster=None,
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, gamma=None, gpu_id=None,
                              importance_type='gain',
                              interaction_constraints=None, learning_rate=None,
                              max_delta_step=None, max_depth=200,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, n_estimators=100,
                              n_jobs=None, num_parallel_tree=None,
                              random_state=None, reg_alpha=None,
                              reg_lambda=None, scale_pos_weight=None,
                              subsample=None, tree_method=None,
                              validate_parameters=None, verbosity=None))])

亲爱的同事们,这似乎是由于 XGB.Regressor 中的一个问题,在任何情况下,在管道中为 MultiOutput Regressor 创建参数的正确方法是:

parameters = {
    'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    'estimator__estimator__max_depth': [10, 100, 1000]
}