自定义 Sklearn Transformer 更改 GridSearchCV 中 X 的形状
Custom Sklearn Transformer changes the shape of X in the GridSearchCV
我想用自定义转换器构建整个管道,但我发现我构建的一些转换器在不包括交叉验证时工作得很好,即:
pipe.fit(X_train, y_train)
WORKS,而 GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1).fit(X_train, y_train)
导致 ERROR.
在这个例子中,我使用的是 OneHotEncoder,它在与 make_column_transformer
或 ColumnTransformer
一起使用时工作正常,但当它被放入自定义转换器时它就没有了。
代码:
BASE_TREE_MODEL = RandomForestRegressor()
class data_get_dummies(BaseEstimator, TransformerMixin):
def __init__(self, columns:list = CATEGORICAL_FEATURES):
self.columns = columns
self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
def fit(self, X, y = None):
self.encoder.fit(X)
return self
def transform(self, X, y = None) -> pd.DataFrame:
X_ = X.copy()
df_temp=pd.DataFrame(self.encoder.fit_transform(X_), columns=self.encoder.get_feature_names_out())
return df_temp
data_get_dummies_ = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), CATEGORICAL_FEATURES),remainder='passthrough')
pipe = Pipeline([
('start', data_get_dummies()),
('model', BASE_TREE_MODEL)
])
param_grid = dict()
grid_search = GridSearchCV(pipe, param_grid, cv=5, verbose=1, n_jobs=-1)
cv_model = grid_search.fit(X_train, y_train)
print('Pipeline:')
print(cv_model.best_estimator_)
print('----------------------')
print('Score:')
print(cv_model.best_score_)
错误:
/Users/simado/opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- onehotencoder__Pastato energijos suvartojimo klase:_E
- onehotencoder__Pastato tipas:_Karkasinis
- onehotencoder__Sildymas:_Geoterminis, kita, centrinis kolektorinis
Feature names seen at fit time, yet now missing:
- onehotencoder__Artimiausi darzeliai_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios mokyklos_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios parduotuves_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios stoteles_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Gatve_Virsilu g.
- ...
几天来我一直在努力解决这个问题,但一直想不通:(
我认为转换方法应该只是 transform
而不是 fit_transform
请试试这个:
BASE_TREE_MODEL = RandomForestRegressor()
class data_get_dummies(BaseEstimator, TransformerMixin):
def __init__(self, columns:list = CATEGORICAL_FEATURES):
self.columns = columns
self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
def fit(self, X, y = None):
self.encoder.fit(X)
return self
def transform(self, X, y = None) -> pd.DataFrame:
X_ = X.copy()
df_temp=pd.DataFrame(self.encoder.transform(X_), columns=self.encoder.get_feature_names_out())
return df_temp
问题是当您调用 transform
:
时,您正在改装 OneHotEncoder
df_temp=pd.DataFrame(self.encoder.fit_transform(X_),
columns=self.encoder.get_feature_names_out())
因此,当您在 testing/CV 中遇到分类特征的不可见值时,您的输出将具有与训练中不同的维度,并且会引发错误。你不应该在测试中重新训练你的编码器,只需转换:
df_temp=pd.DataFrame(self.encoder.transform(X_),
columns=self.encoder.get_feature_names_out())
我想用自定义转换器构建整个管道,但我发现我构建的一些转换器在不包括交叉验证时工作得很好,即:
pipe.fit(X_train, y_train)
WORKS,而 GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1).fit(X_train, y_train)
导致 ERROR.
在这个例子中,我使用的是 OneHotEncoder,它在与 make_column_transformer
或 ColumnTransformer
一起使用时工作正常,但当它被放入自定义转换器时它就没有了。
代码:
BASE_TREE_MODEL = RandomForestRegressor()
class data_get_dummies(BaseEstimator, TransformerMixin):
def __init__(self, columns:list = CATEGORICAL_FEATURES):
self.columns = columns
self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
def fit(self, X, y = None):
self.encoder.fit(X)
return self
def transform(self, X, y = None) -> pd.DataFrame:
X_ = X.copy()
df_temp=pd.DataFrame(self.encoder.fit_transform(X_), columns=self.encoder.get_feature_names_out())
return df_temp
data_get_dummies_ = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), CATEGORICAL_FEATURES),remainder='passthrough')
pipe = Pipeline([
('start', data_get_dummies()),
('model', BASE_TREE_MODEL)
])
param_grid = dict()
grid_search = GridSearchCV(pipe, param_grid, cv=5, verbose=1, n_jobs=-1)
cv_model = grid_search.fit(X_train, y_train)
print('Pipeline:')
print(cv_model.best_estimator_)
print('----------------------')
print('Score:')
print(cv_model.best_score_)
错误:
/Users/simado/opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- onehotencoder__Pastato energijos suvartojimo klase:_E
- onehotencoder__Pastato tipas:_Karkasinis
- onehotencoder__Sildymas:_Geoterminis, kita, centrinis kolektorinis
Feature names seen at fit time, yet now missing:
- onehotencoder__Artimiausi darzeliai_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios mokyklos_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios parduotuves_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios stoteles_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Gatve_Virsilu g.
- ...
几天来我一直在努力解决这个问题,但一直想不通:(
我认为转换方法应该只是 transform
而不是 fit_transform
请试试这个:
BASE_TREE_MODEL = RandomForestRegressor()
class data_get_dummies(BaseEstimator, TransformerMixin):
def __init__(self, columns:list = CATEGORICAL_FEATURES):
self.columns = columns
self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
def fit(self, X, y = None):
self.encoder.fit(X)
return self
def transform(self, X, y = None) -> pd.DataFrame:
X_ = X.copy()
df_temp=pd.DataFrame(self.encoder.transform(X_), columns=self.encoder.get_feature_names_out())
return df_temp
问题是当您调用 transform
:
OneHotEncoder
df_temp=pd.DataFrame(self.encoder.fit_transform(X_),
columns=self.encoder.get_feature_names_out())
因此,当您在 testing/CV 中遇到分类特征的不可见值时,您的输出将具有与训练中不同的维度,并且会引发错误。你不应该在测试中重新训练你的编码器,只需转换:
df_temp=pd.DataFrame(self.encoder.transform(X_),
columns=self.encoder.get_feature_names_out())