管道各个部分中的 ColumnTransformer 不能很好地发挥作用

ColumnTransformer(s) in various parts of a pipeline do not play well

我正在使用 sklearnmlxtend.regressor.StackingRegressor 构建堆叠回归模型。 例如,假设我想要以下小型管道:

  1. 具有两个回归器的堆叠回归器:
    • 管道:
      • 执行数据插补
      • 1-hot 编码分类特征
      • 执行线性回归
    • 管道:
      • 执行数据插补
      • 使用决策树执行回归

不幸的是,这是不可能的,因为 StackingRegressor 在其输入数据中不接受 NaN。 即使它的回归器知道如何处理 NaN,这也是如此,因为在我的情况下,回归器实际上是执行数据插补的管道。

然而,这不是问题:我可以将数据插补移到堆叠回归量之外。 现在我的管道看起来像这样:

  1. 执行数据插补
  2. 应用具有两个回归器的堆叠回归器:
    • 管道:
      • 1-hot 编码分类特征
      • 标准化数值特征
      • 执行线性回归
    • 一个sklearn.tree.DecisionTreeRegressor.

人们可能会尝试按如下方式实现它(this gist 中的整个最小工作示例,带有注释):

sr_linear = Pipeline(steps=[
    ('preprocessing', ColumnTransformer(transformers=[
        ('categorical',
             make_pipeline(OneHotEncoder(), StandardScaler()),
             make_column_selector(dtype_include='category')),
        ('numerical',
             StandardScaler(),
             make_column_selector(dtype_include=np.number))
    ])),
    ('model', LinearRegression())
])

sr_tree = DecisionTreeRegressor()

ct_imputation = ColumnTransformer(transformers=[
    ('categorical',
        SimpleImputer(strategy='constant', fill_value='None'),
        make_column_selector(dtype_include='category')),
    ('numerical',
        SimpleImputer(strategy='median'),
        make_column_selector(dtype_include=np.number))
])

stacked_regressor = Pipeline(steps=[
    ('imputation', ct_imputation),
    ('back_to_pandas', FunctionTransformer(
        func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
    )),
    ('model', StackingRegressor(
        regressors=[sr_linear, sr_tree],
        meta_regressor=DecisionTreeRegressor(),
        use_features_in_secondary=True
    ))
])

请注意,“外部”ColumnTransformer(在 stacked_regressor)returns 是一个 numpy 矩阵。 但是“内部”ColumnTransformer(在 sr_linear 中)需要一个 pandas.DataFrame,所以我不得不使用步骤 back_to_pandas 将矩阵转换回数据框。 (要使用get_feature_names_out我不得不使用夜间版的sklearn,因为目前稳定的1.0.2版本还不支持它。幸好它可以用one simple command安装。)

以上代码在调用stacked_regressor.fit()时失败,报错如下(整个stacktrace又在the gist):

ValueError: make_column_selector can only be applied to pandas dataframes

但是,因为我在外部管道中添加了 back_to_pandas 步骤,所以内部管道 应该 得到一个 pandas 数据框! 事实上,如果我只fit_transform()我的ct_imputation对象,我显然获得了一个pandas数据框。 我无法理解传递的数据在何时何地不再是数据框。 为什么我的代码失败了?

我认为这个问题必须归因于 StackingRegressor。实际上,我不是它的用法专家,我仍然没有探索它的源代码,但我发现这个 sklearn issue - #16473 这似乎暗示 << [回归量和 meta_regressors] 不保留数据帧 >>(尽管这是指 sklearn StackingRegressor 实例,而不是 mlxtend 实例)。

确实,看看用 sr_linear 管道替换它后会发生什么:

from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

from mlxtend.regressor import StackingRegressor

import numpy as np
import pandas as pd

# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame

# Small data preprocessing:
for column in d.columns:
    if d[column].dtype == object or column == 'MSSubClass':
        d[column] = pd.Categorical(d[column])
    
d.drop(columns='Id', inplace=True)

# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]

# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
    ('categorical',
         make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
         make_column_selector(dtype_include='category')),
     ('numerical',
         StandardScaler(),
         make_column_selector(dtype_include=np.number))
    ])),
    ('model', LinearRegression())
])

ct_imputation = ColumnTransformer(transformers=[
    ('categorical',
        SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
        make_column_selector(dtype_include='category')),
    ('numerical',
        SimpleImputer(strategy='median'),
        make_column_selector(dtype_include=np.number))
])

stacked_regressor = Pipeline(steps=[
    ('imputation', ct_imputation),
    ('back_to_pandas', FunctionTransformer(
        func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
    )),
    ('mdl', sr_linear)
])

stacked_regressor.fit(X_train, y_train)

注意到我不得不稍微修改 'back_to_pandas' 步骤,因为出于某种原因 pd.DataFrame 正在将列的 dtypes 更改为仅 'object'(从 'category''float64'),因此与 sr_linear 中执行的插补冲突。为此,我将 .astype(types) 应用于 pd.DataFrame 构造函数,其中 types 定义如下(基于 .get_feature_names_out() 方法的实现 SimpleImputer 来自 dev 版本的 sklearn):

types = {} 
for col in d.columns[:-1]: 
    if d[col].dtype == 'category':
        types['categorical__' + col] = str(d[col].dtype)
    else:
        types['numerical__' + col] = str(d[col].dtype)

正确的做法是:

  1. mlxtend 移动到 sklearnStackingRegressor。我相信前者是 sklearn 仍然没有堆叠回归器时的创造者。现在不需要使用更多 'obscure' 解决方案。 sklearn 的堆叠回归器工作得很好。
  2. 将 1-hot-encoding 步骤移动到外部管道,因为(令人惊讶的是!)sklearnDecisionTreeRegressor 无法处理特征中的分类数据。

代码的工作版本如下:

from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor

import numpy as np
import pandas as pd

def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
    for column in df.columns:
        if df[column].dtype == object or 'MSSubClass' in column:
            df[column] = pd.Categorical(df[column])

    return df

d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')

sr_linear = Pipeline(steps=[
    ('preprocessing', StandardScaler()),
    ('model', LinearRegression())
])

ct_preprocessing = ColumnTransformer(transformers=[
    ('categorical',
        make_pipeline(
            SimpleImputer(strategy='constant', fill_value='None'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')
        ),
        make_column_selector(dtype_include='category')),
    ('numerical',
        SimpleImputer(strategy='median'),
        make_column_selector(dtype_include=np.number))
])

stacking_regressor = Pipeline(steps=[
    ('preprocessing', ct_preprocessing),
    ('model', StackingRegressor(
        estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
        final_estimator=DecisionTreeRegressor(),
        passthrough=True
    ))
])

label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

stacking_regressor.fit(X_train, y_train)

感谢用户 amiola 让我走上正轨。