SKLEARN // 将 GridsearchCV 与列变换和管道相结合
SKLEARN // Combine GridsearchCV with column transform and pipeline
我正在为一个机器学习项目而苦苦挣扎,我正在尝试将其结合起来:
- 一个 sklearn 列变换,将不同的变换器应用于我的数值和分类特征
- 应用我的不同转换器和估算器的管道
- a
GridSearchCV
搜索最佳参数。
只要我 fill-in 在管道中手动设置不同转换器的参数,代码就可以完美运行。
但是,一旦我尝试传递不同值的列表以在我的 gridsearch 参数中进行比较,我就会收到各种无效参数错误消息。
这是我的代码:
首先我将特征分为数值特征和分类特征
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
然后我为数值和分类特征创建了 2 个不同的预处理管道:
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))
我将两者组合到另一个管道中,设置我的参数,运行我的GridSearchCV
代码
model=make_pipeline(preprocessor, LinearRegression() )
params={
'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')
我尝试了不同的方法来声明参数,但从未找到合适的方法。我总是收到 "invalid parameter" 错误消息。
你能帮我了解一下哪里出了问题吗?
真的很感谢大家的支持,保重身体!
我假设您可能已将 preprocessor
定义如下,
preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
('cat_pipeline', cat_pipeline)])
然后您需要更改您的参数名称如下:
pipeline__numerical_pipeline__knnimputer__n_neighbors
但是,代码还有其他几个问题:
您不必在执行 GridSearchCV
后调用 cross_val_score
。 GridSearchCV 本身的输出将具有每个超参数组合的交叉验证结果。
当您的数据包含字符串数据时,KNNImputer
将不起作用。您需要在 num_pipeline
之前申请 cat_pipeline
。
完整示例:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
'rating': [5, 3, 4, 5]}) # doctest: +SKIP
y = [1,0,1,1]
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )
params={
'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)
grid.fit(X, y)
我正在为一个机器学习项目而苦苦挣扎,我正在尝试将其结合起来:
- 一个 sklearn 列变换,将不同的变换器应用于我的数值和分类特征
- 应用我的不同转换器和估算器的管道
- a
GridSearchCV
搜索最佳参数。
只要我 fill-in 在管道中手动设置不同转换器的参数,代码就可以完美运行。 但是,一旦我尝试传递不同值的列表以在我的 gridsearch 参数中进行比较,我就会收到各种无效参数错误消息。
这是我的代码:
首先我将特征分为数值特征和分类特征
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
然后我为数值和分类特征创建了 2 个不同的预处理管道:
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))
我将两者组合到另一个管道中,设置我的参数,运行我的GridSearchCV
代码
model=make_pipeline(preprocessor, LinearRegression() )
params={
'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')
我尝试了不同的方法来声明参数,但从未找到合适的方法。我总是收到 "invalid parameter" 错误消息。
你能帮我了解一下哪里出了问题吗?
真的很感谢大家的支持,保重身体!
我假设您可能已将 preprocessor
定义如下,
preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
('cat_pipeline', cat_pipeline)])
然后您需要更改您的参数名称如下:
pipeline__numerical_pipeline__knnimputer__n_neighbors
但是,代码还有其他几个问题:
您不必在执行
GridSearchCV
后调用cross_val_score
。 GridSearchCV 本身的输出将具有每个超参数组合的交叉验证结果。
当您的数据包含字符串数据时,KNNImputer
将不起作用。您需要在num_pipeline
之前申请cat_pipeline
。
完整示例:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
'rating': [5, 3, 4, 5]}) # doctest: +SKIP
y = [1,0,1,1]
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )
params={
'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)
grid.fit(X, y)