GridsearchCV 未正确处理 scikit 管道
scikit pipeline is not proceeded correctly with GridsearchCV
我正在尝试为数据集提供分类变量和数值变量。所以我对分类特征进行热编码并将其输入到 gridsearchCV 中使用的管道中。当我尝试拟合模型时,错误出现在最后一行。我的理解是它不会执行在拟合模型之前通过管道的工作,因为它在编码之前在列名上给出了类型错误。正确的流程应该是怎样的?
错误:
TypeError: '['First' 'Second' 'Third']' is an invalid key
我的代码:
y = sample.iloc[:, -1:]
X = sample.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.90, random_state=2, shuffle=True
)
categorical_columns = [
"first",
"second",
"third"]
numerical_columns = [
"fourth",
"thith",
"sixth"
]
categorical_encoder = preprocessing.OneHotEncoder()
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, enc_sample[categorical_columns].values.reshape(-1, 3)),
('num', 'passthrough', enc_sample[numerical_columns])])
pipe = Pipeline([
('preprocess', preprocessing),
('classifier', GradientBoostingRegressor())
])
cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=3)
search_grid = {
"classifier__n_estimators": [100],
"classifier__learning_rate": [0.1],
"classifier__max_depth": [5],
"classifier__min_samples_leaf":[8],
"classifier__subsample":[0.6]
}
search = GridSearchCV(
estimator=pipe, param_grid=search_grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, return_train_score=True
)
search.fit(X_train, y_train)
作为参考,我使用了官方文档如下:
https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
您的列转换器似乎没有 select 分类和数字列。您可以根据类型使用 sklearn.compose.make_column_selector
到 select 数据来解决此问题。
您可以按如下方式使用它:
from sklearn.compose import make_column_selector
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, make_column_selector(dtype_include=object)),
('num', 'passthrough', make_column_selector(dtype_exclude=object))])
我正在尝试为数据集提供分类变量和数值变量。所以我对分类特征进行热编码并将其输入到 gridsearchCV 中使用的管道中。当我尝试拟合模型时,错误出现在最后一行。我的理解是它不会执行在拟合模型之前通过管道的工作,因为它在编码之前在列名上给出了类型错误。正确的流程应该是怎样的?
错误:
TypeError: '['First' 'Second' 'Third']' is an invalid key
我的代码:
y = sample.iloc[:, -1:]
X = sample.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.90, random_state=2, shuffle=True
)
categorical_columns = [
"first",
"second",
"third"]
numerical_columns = [
"fourth",
"thith",
"sixth"
]
categorical_encoder = preprocessing.OneHotEncoder()
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, enc_sample[categorical_columns].values.reshape(-1, 3)),
('num', 'passthrough', enc_sample[numerical_columns])])
pipe = Pipeline([
('preprocess', preprocessing),
('classifier', GradientBoostingRegressor())
])
cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=3)
search_grid = {
"classifier__n_estimators": [100],
"classifier__learning_rate": [0.1],
"classifier__max_depth": [5],
"classifier__min_samples_leaf":[8],
"classifier__subsample":[0.6]
}
search = GridSearchCV(
estimator=pipe, param_grid=search_grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, return_train_score=True
)
search.fit(X_train, y_train)
作为参考,我使用了官方文档如下: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
您的列转换器似乎没有 select 分类和数字列。您可以根据类型使用 sklearn.compose.make_column_selector
到 select 数据来解决此问题。
您可以按如下方式使用它:
from sklearn.compose import make_column_selector
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, make_column_selector(dtype_include=object)),
('num', 'passthrough', make_column_selector(dtype_exclude=object))])