GridsearchCV 未正确处理 scikit 管道

Question

我正在尝试为数据集提供分类变量和数值变量。所以我对分类特征进行热编码并将其输入到 gridsearchCV 中使用的管道中。当我尝试拟合模型时，错误出现在最后一行。我的理解是它不会执行在拟合模型之前通过管道的工作，因为它在编码之前在列名上给出了类型错误。正确的流程应该是怎样的？

错误：

TypeError: '['First' 'Second' 'Third']' is an invalid key

我的代码：

y = sample.iloc[:, -1:]
X = sample.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.90, random_state=2, shuffle=True
)

categorical_columns = [
    "first",
    "second",
    "third"]
numerical_columns = [
    "fourth",
    "thith", 
    "sixth"
]
categorical_encoder = preprocessing.OneHotEncoder()

preprocessing = ColumnTransformer(
    [('cat', categorical_encoder, enc_sample[categorical_columns].values.reshape(-1, 3)),
     ('num', 'passthrough', enc_sample[numerical_columns])])

pipe = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', GradientBoostingRegressor())
])

cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=3)

search_grid = {
    "classifier__n_estimators": [100],
    "classifier__learning_rate": [0.1],
    "classifier__max_depth": [5],
    "classifier__min_samples_leaf":[8],
    "classifier__subsample":[0.6]
}
search = GridSearchCV(
    estimator=pipe, param_grid=search_grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, return_train_score=True
)
search.fit(X_train, y_train)

作为参考，我使用了官方文档如下： https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

Answer 1

您的列转换器似乎没有 select 分类和数字列。您可以根据类型使用 sklearn.compose.make_column_selector 到 select 数据来解决此问题。

您可以按如下方式使用它：

from sklearn.compose import make_column_selector
preprocessing = ColumnTransformer(
    [('cat', categorical_encoder, make_column_selector(dtype_include=object)),
     ('num', 'passthrough', make_column_selector(dtype_exclude=object))])

GridsearchCV 未正确处理 scikit 管道

scikit pipeline is not proceeded correctly with GridsearchCV

python

scikit-learn

boosting

one-hot-encoding