管道各个部分中的 ColumnTransformer 不能很好地发挥作用
ColumnTransformer(s) in various parts of a pipeline do not play well
我正在使用 sklearn
和 mlxtend.regressor.StackingRegressor
构建堆叠回归模型。
例如,假设我想要以下小型管道:
- 具有两个回归器的堆叠回归器:
- 管道:
- 执行数据插补
- 1-hot 编码分类特征
- 执行线性回归
- 管道:
- 执行数据插补
- 使用决策树执行回归
不幸的是,这是不可能的,因为 StackingRegressor
在其输入数据中不接受 NaN
。
即使它的回归器知道如何处理 NaN
,这也是如此,因为在我的情况下,回归器实际上是执行数据插补的管道。
然而,这不是问题:我可以将数据插补移到堆叠回归量之外。
现在我的管道看起来像这样:
- 执行数据插补
- 应用具有两个回归器的堆叠回归器:
- 管道:
- 1-hot 编码分类特征
- 标准化数值特征
- 执行线性回归
- 一个
sklearn.tree.DecisionTreeRegressor
.
人们可能会尝试按如下方式实现它(this gist 中的整个最小工作示例,带有注释):
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler()),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
sr_tree = DecisionTreeRegressor()
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
)),
('model', StackingRegressor(
regressors=[sr_linear, sr_tree],
meta_regressor=DecisionTreeRegressor(),
use_features_in_secondary=True
))
])
请注意,“外部”ColumnTransformer
(在 stacked_regressor
)returns 是一个 numpy
矩阵。
但是“内部”ColumnTransformer
(在 sr_linear
中)需要一个 pandas.DataFrame
,所以我不得不使用步骤 back_to_pandas
将矩阵转换回数据框。
(要使用get_feature_names_out
我不得不使用夜间版的sklearn,因为目前稳定的1.0.2版本还不支持它。幸好它可以用one simple command安装。)
以上代码在调用stacked_regressor.fit()
时失败,报错如下(整个stacktrace又在the gist):
ValueError: make_column_selector can only be applied to pandas dataframes
但是,因为我在外部管道中添加了 back_to_pandas
步骤,所以内部管道 应该 得到一个 pandas 数据框!
事实上,如果我只fit_transform()
我的ct_imputation
对象,我显然获得了一个pandas数据框。
我无法理解传递的数据在何时何地不再是数据框。
为什么我的代码失败了?
我认为这个问题必须归因于 StackingRegressor
。实际上,我不是它的用法专家,我仍然没有探索它的源代码,但我发现这个 sklearn issue - #16473 这似乎暗示 << [回归量和 meta_regressors] 不保留数据帧 >>(尽管这是指 sklearn
StackingRegressor
实例,而不是 mlxtend
实例)。
确实,看看用 sr_linear
管道替换它后会发生什么:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mlxtend.regressor import StackingRegressor
import numpy as np
import pandas as pd
# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame
# Small data preprocessing:
for column in d.columns:
if d[column].dtype == object or column == 'MSSubClass':
d[column] = pd.Categorical(d[column])
d.drop(columns='Id', inplace=True)
# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
)),
('mdl', sr_linear)
])
stacked_regressor.fit(X_train, y_train)
注意到我不得不稍微修改 'back_to_pandas'
步骤,因为出于某种原因 pd.DataFrame
正在将列的 dtypes
更改为仅 'object'
(从 'category'
和 'float64'
),因此与 sr_linear
中执行的插补冲突。为此,我将 .astype(types)
应用于 pd.DataFrame
构造函数,其中 types
定义如下(基于 .get_feature_names_out()
方法的实现 SimpleImputer
来自 dev 版本的 sklearn
):
types = {}
for col in d.columns[:-1]:
if d[col].dtype == 'category':
types['categorical__' + col] = str(d[col].dtype)
else:
types['numerical__' + col] = str(d[col].dtype)
正确的做法是:
- 从
mlxtend
移动到 sklearn
的 StackingRegressor
。我相信前者是 sklearn
仍然没有堆叠回归器时的创造者。现在不需要使用更多 'obscure' 解决方案。 sklearn
的堆叠回归器工作得很好。
- 将 1-hot-encoding 步骤移动到外部管道,因为(令人惊讶的是!)
sklearn
的 DecisionTreeRegressor
无法处理特征中的分类数据。
代码的工作版本如下:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
import numpy as np
import pandas as pd
def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
if df[column].dtype == object or 'MSSubClass' in column:
df[column] = pd.Categorical(df[column])
return df
d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')
sr_linear = Pipeline(steps=[
('preprocessing', StandardScaler()),
('model', LinearRegression())
])
ct_preprocessing = ColumnTransformer(transformers=[
('categorical',
make_pipeline(
SimpleImputer(strategy='constant', fill_value='None'),
OneHotEncoder(sparse=False, handle_unknown='ignore')
),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacking_regressor = Pipeline(steps=[
('preprocessing', ct_preprocessing),
('model', StackingRegressor(
estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
final_estimator=DecisionTreeRegressor(),
passthrough=True
))
])
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
stacking_regressor.fit(X_train, y_train)
感谢用户 amiola 让我走上正轨。
我正在使用 sklearn
和 mlxtend.regressor.StackingRegressor
构建堆叠回归模型。
例如,假设我想要以下小型管道:
- 具有两个回归器的堆叠回归器:
- 管道:
- 执行数据插补
- 1-hot 编码分类特征
- 执行线性回归
- 管道:
- 执行数据插补
- 使用决策树执行回归
- 管道:
不幸的是,这是不可能的,因为 StackingRegressor
在其输入数据中不接受 NaN
。
即使它的回归器知道如何处理 NaN
,这也是如此,因为在我的情况下,回归器实际上是执行数据插补的管道。
然而,这不是问题:我可以将数据插补移到堆叠回归量之外。 现在我的管道看起来像这样:
- 执行数据插补
- 应用具有两个回归器的堆叠回归器:
- 管道:
- 1-hot 编码分类特征
- 标准化数值特征
- 执行线性回归
- 一个
sklearn.tree.DecisionTreeRegressor
.
- 管道:
人们可能会尝试按如下方式实现它(this gist 中的整个最小工作示例,带有注释):
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler()),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
sr_tree = DecisionTreeRegressor()
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
)),
('model', StackingRegressor(
regressors=[sr_linear, sr_tree],
meta_regressor=DecisionTreeRegressor(),
use_features_in_secondary=True
))
])
请注意,“外部”ColumnTransformer
(在 stacked_regressor
)returns 是一个 numpy
矩阵。
但是“内部”ColumnTransformer
(在 sr_linear
中)需要一个 pandas.DataFrame
,所以我不得不使用步骤 back_to_pandas
将矩阵转换回数据框。
(要使用get_feature_names_out
我不得不使用夜间版的sklearn,因为目前稳定的1.0.2版本还不支持它。幸好它可以用one simple command安装。)
以上代码在调用stacked_regressor.fit()
时失败,报错如下(整个stacktrace又在the gist):
ValueError: make_column_selector can only be applied to pandas dataframes
但是,因为我在外部管道中添加了 back_to_pandas
步骤,所以内部管道 应该 得到一个 pandas 数据框!
事实上,如果我只fit_transform()
我的ct_imputation
对象,我显然获得了一个pandas数据框。
我无法理解传递的数据在何时何地不再是数据框。
为什么我的代码失败了?
我认为这个问题必须归因于 StackingRegressor
。实际上,我不是它的用法专家,我仍然没有探索它的源代码,但我发现这个 sklearn issue - #16473 这似乎暗示 << [回归量和 meta_regressors] 不保留数据帧 >>(尽管这是指 sklearn
StackingRegressor
实例,而不是 mlxtend
实例)。
确实,看看用 sr_linear
管道替换它后会发生什么:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mlxtend.regressor import StackingRegressor
import numpy as np
import pandas as pd
# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame
# Small data preprocessing:
for column in d.columns:
if d[column].dtype == object or column == 'MSSubClass':
d[column] = pd.Categorical(d[column])
d.drop(columns='Id', inplace=True)
# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
)),
('mdl', sr_linear)
])
stacked_regressor.fit(X_train, y_train)
注意到我不得不稍微修改 'back_to_pandas'
步骤,因为出于某种原因 pd.DataFrame
正在将列的 dtypes
更改为仅 'object'
(从 'category'
和 'float64'
),因此与 sr_linear
中执行的插补冲突。为此,我将 .astype(types)
应用于 pd.DataFrame
构造函数,其中 types
定义如下(基于 .get_feature_names_out()
方法的实现 SimpleImputer
来自 dev 版本的 sklearn
):
types = {}
for col in d.columns[:-1]:
if d[col].dtype == 'category':
types['categorical__' + col] = str(d[col].dtype)
else:
types['numerical__' + col] = str(d[col].dtype)
正确的做法是:
- 从
mlxtend
移动到sklearn
的StackingRegressor
。我相信前者是sklearn
仍然没有堆叠回归器时的创造者。现在不需要使用更多 'obscure' 解决方案。sklearn
的堆叠回归器工作得很好。 - 将 1-hot-encoding 步骤移动到外部管道,因为(令人惊讶的是!)
sklearn
的DecisionTreeRegressor
无法处理特征中的分类数据。
代码的工作版本如下:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
import numpy as np
import pandas as pd
def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
if df[column].dtype == object or 'MSSubClass' in column:
df[column] = pd.Categorical(df[column])
return df
d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')
sr_linear = Pipeline(steps=[
('preprocessing', StandardScaler()),
('model', LinearRegression())
])
ct_preprocessing = ColumnTransformer(transformers=[
('categorical',
make_pipeline(
SimpleImputer(strategy='constant', fill_value='None'),
OneHotEncoder(sparse=False, handle_unknown='ignore')
),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacking_regressor = Pipeline(steps=[
('preprocessing', ct_preprocessing),
('model', StackingRegressor(
estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
final_estimator=DecisionTreeRegressor(),
passthrough=True
))
])
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
stacking_regressor.fit(X_train, y_train)
感谢用户 amiola