自定义 Sklearn Transformer 更改 GridSearchCV 中 X 的形状

Custom Sklearn Transformer changes the shape of X in the GridSearchCV

我想用自定义转换器构建整个管道,但我发现我构建的一些转换器在不包括交叉验证时工作得很好,即: pipe.fit(X_train, y_train) WORKS,而 GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1).fit(X_train, y_train) 导致 ERROR.

在这个例子中,我使用的是 OneHotEncoder,它在与 make_column_transformerColumnTransformer 一起使用时工作正常,但当它被放入自定义转换器时它就没有了。

代码:

BASE_TREE_MODEL = RandomForestRegressor()

class data_get_dummies(BaseEstimator, TransformerMixin):
    def __init__(self, columns:list = CATEGORICAL_FEATURES):
        self.columns = columns
        self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
    def fit(self, X, y = None):
        self.encoder.fit(X)
        return self
    def transform(self, X, y = None) -> pd.DataFrame:
        X_ = X.copy()
        df_temp=pd.DataFrame(self.encoder.fit_transform(X_), columns=self.encoder.get_feature_names_out())
        return df_temp

data_get_dummies_ = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), CATEGORICAL_FEATURES),remainder='passthrough')


pipe = Pipeline([
                ('start', data_get_dummies()),
                ('model', BASE_TREE_MODEL)
                ])

param_grid = dict()
grid_search = GridSearchCV(pipe, param_grid, cv=5, verbose=1, n_jobs=-1)
cv_model = grid_search.fit(X_train, y_train)

print('Pipeline:')
print(cv_model.best_estimator_)
print('----------------------')
print('Score:')
print(cv_model.best_score_)

错误:

/Users/simado/opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- onehotencoder__Pastato energijos suvartojimo klase:_E
- onehotencoder__Pastato tipas:_Karkasinis
- onehotencoder__Sildymas:_Geoterminis, kita, centrinis kolektorinis
Feature names seen at fit time, yet now missing:
- onehotencoder__Artimiausi darzeliai_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios mokyklos_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios parduotuves_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios stoteles_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Gatve_Virsilu g.
- ...

几天来我一直在努力解决这个问题,但一直想不通:(

我认为转换方法应该只是 transform 而不是 fit_transform

请试试这个:

BASE_TREE_MODEL = RandomForestRegressor()

class data_get_dummies(BaseEstimator, TransformerMixin):
    def __init__(self, columns:list = CATEGORICAL_FEATURES):
        self.columns = columns
        self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
    def fit(self, X, y = None):
        self.encoder.fit(X)
        return self
    def transform(self, X, y = None) -> pd.DataFrame:
        X_ = X.copy()
        df_temp=pd.DataFrame(self.encoder.transform(X_), columns=self.encoder.get_feature_names_out())
        return df_temp

问题是当您调用 transform:

时,您正在改装 OneHotEncoder
df_temp=pd.DataFrame(self.encoder.fit_transform(X_), 
                     columns=self.encoder.get_feature_names_out())

因此,当您在 testing/CV 中遇到分类特征的不可见值时,您的输出将具有与训练中不同的维度,并且会引发错误。你不应该在测试中重新训练你的编码器,只需转换:

df_temp=pd.DataFrame(self.encoder.transform(X_),
                     columns=self.encoder.get_feature_names_out())