是否可以为可选的 sklearn 管道步骤优化超参数?

Is it possible to optimize hyperparameters for optional sklearn pipeline steps?

我试图构建一个包含一些可选步骤的管道。但是,我想为这些步骤优化超参数,因为我想在不使用它们和使用不同配置(在我的例子中是 SelectFromModel - sfm)之间获得最佳选择。

clf = RandomForestRegressor(random_state = 1)
stdscl = StandardScaler()
sfm = SelectFromModel(RandomForestRegressor(random_state=1))

p_grid_lr = {"clf__max_depth": [10, 50, 100, None],
             "clf__n_estimators": [10, 50, 100, 200, 500, 800],
             "clf__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
             "sfm": ['passthrough', sfm],
             "sfm__max_depth": [10, 50, 100, None],
             "sfm__n_estimators": [10, 50, 100, 200, 500, 800],
             "sfm__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
            }

pipeline=Pipeline([
                 ('scl',stdscl),
                 ('sfm',sfm),
                 ('clf',clf)
                  ])

gs_clf = GridSearchCV(estimator = pipeline, param_grid = p_grid_lr, cv =KFold(shuffle = True, n_splits = 5, random_state=1),scoring = 'r2', n_jobs =- 1)
gs_clf.fit(X_train, y_train)

clf = gs_clf.best_estimator_

我得到的错误是 'string' object has no attribute 'set_params' 这是可以理解的。有没有办法指定应该一起尝试哪些组合,在我的例子中只有 'passthrough' 本身和具有不同超参数的 sfm?

谢谢!

参考 this 示例,您可以制作一个 词典列表 。一个包含 sfm 及其相关参数,另一个不使用 "passthrough".

如@Robin 所指定,您可以将 p_grid_lr 定义为字典列表。实际上,docs of GridSearchCV 在此提案中指出:

param_grid: dict or list of dictionaries

Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

p_grid_lr = [
    {
        "clf__max_depth": [10, 50, 100, None],
        "clf__n_estimators": [10, 50, 100, 200, 500, 800],
        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
        "sfm__estimator__max_depth": [10, 50, 100, None],
        "sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
        "sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
    },
    {
        "clf__max_depth": [10, 50, 100, None],
        "clf__n_estimators": [10, 50, 100, 200, 500, 800],
        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
        "sfm": ['passthrough'],
    }
]

可扩展性较差的替代方案(针对您的情况)可能如下

p_grid_lr_ = {
    "clf__max_depth": [10, 50, 100, None],
    "clf__n_estimators": [10, 50, 100, 200, 500, 800],
    "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
    "sfm": ['passthrough', 
            SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)),
            SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)),
            ...]
}

为您的参数指定所有可能的组合。

此外,请注意,要从 SelectFromModel 中的 RandomForestRegressor 估算器访问参数 max_depthn_estimatorsmax_features,您应该将参数键入

"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']

而不是

"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']

因为这些参数来自估计器本身(max_features 原则上也可能是来自 SelectFromModel 的参数,但在这种情况下它可能只能获得来自 [=28= 的整数值]).

一般来说,您可以通过 pipeline.get_params().keys()(通常是 estimator.get_params().keys())访问所有可能优化的参数。

最后,这是 user guide for Pipelines.

的精彩阅读