sklearn 管道：ColumnTransformer 不按顺序执行步骤，管道不保留特征名称

Question

我有以下（假设的）数据集：

numerical_ok	numerical_missing	categorical
210	30	cat1
180	NaN	cat2
70	19	cat3

其中 categorical 是一个字符串并且 numerical_ok 和 numerical_missing 都是数字列，但最后一个有一些缺失点。我要执行这三个任务：

OneHotEncode 分类列
使用 SimpleInputer 或 sklearn

numerical_missing

在numerical_missing

KBinsDiscretizer

当然，如果我使用混合 pandas/sklearn 方法，这很容易做到：

df["numerical_missing"] = SimpleImputer().fit_transform(df[["numerical_missing"]])

ColumnTransformer([
    ("encoder", OneHotEncoder(), ["categorical"]),
    ("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)

但出于可扩展性和一致性的原因（我稍后会为该数据拟合模型），我想看看这如何使用管道工作。我尝试了两种方法：

方法 1：使用单个 ColumnTransformer.

但由于它似乎联合执行所有步骤，KBinsDiscretizer 运行时仍然缺少数据：

ColumnTransformer([
    ("imputer", SimpleImputer(), ["numerical_missing"]),
    ("encoder", OneHotEncoder(), ["categorical"]),
    ("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)

给出这个错误：

KBinsDiscretizer does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

方法 2： 将两个 ColumnTransformer 组合成一个 Pipeline

现在第一个 Pipeline 的产品是一个稀疏数组，我无法在其中访问原始特征名称。

Pipeline([
    ("transformer", (
        ColumnTransformer([
            ("imputer", SimpleImputer(), ["numerical_missing"]),
            ("encoder", OneHotEncoder(), ["categorical"])
        ], remainder="passthrough")
    )),
    ("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), ["numerical_missing"])]))
]).fit_transform(df)

这给了我：

ValueError: Specifying the columns using strings is only supported for pandas DataFrames```

How to proceed? Thanks.

Answer 1

除了评论中链接的问题和它的链接问题（我通常建议为您想要按顺序执行的每个转换器组合定义一个管道，并且运行并联的单列转换器），在这个小例子中，我还想为您的第二次尝试建议一个 index-based 解决方案。

ColumnTransformer 的输出按其转换器的顺序包含列，其余列在末尾。因此，在您的情况下，输出将是 now-imputed numerical_missing，后跟一些未知数量的 one-hot 编码列，然后是剩余的 numerical_ok。由于您只想对（估算的）numerical_missing 进行装箱，您可以将 discretizer 列转换器指定为在其输入的第 0 列上运行：

Pipeline([
    ("transformer", (
        ColumnTransformer([
            ("imputer", SimpleImputer(), ["numerical_missing"]),
            ("encoder", OneHotEncoder(), ["categorical"])
        ], remainder="passthrough")
    )),
    ("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), [0])]))
]).fit_transform(df)

我倾向于使用列名，因此 single-column-transformer-with-separate-pipelines 方法可能仍然更可取，但这也不是一个糟糕的解决方案。

好的，我想我也可以包括我一直提到的方法。

num_mis_pipe = Pipeline([
    ("imputer", SimpleImputer()),
    ("discretizer", KBinsDiscretizer()),
])
ColumnTransformer([
    ("imp_disc", num_mis_pipe, ["numerical_missing"]),
    ("encoder", OneHotEncoder(), ["categorical"]),
], remainder="passthrough").fit_transform(df)

sklearn 管道：ColumnTransformer 不按顺序执行步骤，管道不保留特征名称

sklearn pipelines: ColumnTransformer doesn't execute steps sequentially and pipeline doesn't keep feature names

python

scikit-learn