是否可以为可选的 sklearn 管道步骤优化超参数?
Is it possible to optimize hyperparameters for optional sklearn pipeline steps?
我试图构建一个包含一些可选步骤的管道。但是,我想为这些步骤优化超参数,因为我想在不使用它们和使用不同配置(在我的例子中是 SelectFromModel - sfm)之间获得最佳选择。
clf = RandomForestRegressor(random_state = 1)
stdscl = StandardScaler()
sfm = SelectFromModel(RandomForestRegressor(random_state=1))
p_grid_lr = {"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough', sfm],
"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
}
pipeline=Pipeline([
('scl',stdscl),
('sfm',sfm),
('clf',clf)
])
gs_clf = GridSearchCV(estimator = pipeline, param_grid = p_grid_lr, cv =KFold(shuffle = True, n_splits = 5, random_state=1),scoring = 'r2', n_jobs =- 1)
gs_clf.fit(X_train, y_train)
clf = gs_clf.best_estimator_
我得到的错误是 'string' object has no attribute 'set_params' 这是可以理解的。有没有办法指定应该一起尝试哪些组合,在我的例子中只有 'passthrough' 本身和具有不同超参数的 sfm?
谢谢!
参考 this 示例,您可以制作一个 词典列表 。一个包含 sfm
及其相关参数,另一个不使用 "passthrough"
.
如@Robin 所指定,您可以将 p_grid_lr
定义为字典列表。实际上,docs of GridSearchCV
在此提案中指出:
param_grid: dict or list of dictionaries
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
p_grid_lr = [
{
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
},
{
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough'],
}
]
可扩展性较差的替代方案(针对您的情况)可能如下
p_grid_lr_ = {
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough',
SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)),
SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)),
...]
}
为您的参数指定所有可能的组合。
此外,请注意,要从 SelectFromModel
中的 RandomForestRegressor
估算器访问参数 max_depth
、n_estimators
和 max_features
,您应该将参数键入
"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
而不是
"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
因为这些参数来自估计器本身(max_features
原则上也可能是来自 SelectFromModel
的参数,但在这种情况下它可能只能获得来自 [=28= 的整数值]).
一般来说,您可以通过 pipeline.get_params().keys()
(通常是 estimator.get_params().keys()
)访问所有可能优化的参数。
最后,这是 user guide for Pipelines.
的精彩阅读
我试图构建一个包含一些可选步骤的管道。但是,我想为这些步骤优化超参数,因为我想在不使用它们和使用不同配置(在我的例子中是 SelectFromModel - sfm)之间获得最佳选择。
clf = RandomForestRegressor(random_state = 1)
stdscl = StandardScaler()
sfm = SelectFromModel(RandomForestRegressor(random_state=1))
p_grid_lr = {"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough', sfm],
"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
}
pipeline=Pipeline([
('scl',stdscl),
('sfm',sfm),
('clf',clf)
])
gs_clf = GridSearchCV(estimator = pipeline, param_grid = p_grid_lr, cv =KFold(shuffle = True, n_splits = 5, random_state=1),scoring = 'r2', n_jobs =- 1)
gs_clf.fit(X_train, y_train)
clf = gs_clf.best_estimator_
我得到的错误是 'string' object has no attribute 'set_params' 这是可以理解的。有没有办法指定应该一起尝试哪些组合,在我的例子中只有 'passthrough' 本身和具有不同超参数的 sfm?
谢谢!
参考 this 示例,您可以制作一个 词典列表 。一个包含 sfm
及其相关参数,另一个不使用 "passthrough"
.
如@Robin 所指定,您可以将 p_grid_lr
定义为字典列表。实际上,docs of GridSearchCV
在此提案中指出:
param_grid: dict or list of dictionaries
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
p_grid_lr = [
{
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
},
{
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough'],
}
]
可扩展性较差的替代方案(针对您的情况)可能如下
p_grid_lr_ = {
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough',
SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)),
SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)),
...]
}
为您的参数指定所有可能的组合。
此外,请注意,要从 SelectFromModel
中的 RandomForestRegressor
估算器访问参数 max_depth
、n_estimators
和 max_features
,您应该将参数键入
"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
而不是
"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
因为这些参数来自估计器本身(max_features
原则上也可能是来自 SelectFromModel
的参数,但在这种情况下它可能只能获得来自 [=28= 的整数值]).
一般来说,您可以通过 pipeline.get_params().keys()
(通常是 estimator.get_params().keys()
)访问所有可能优化的参数。
最后,这是 user guide for Pipelines.
的精彩阅读