Sklearn ColumnTransformer + 管道 = TypeError

Question

我正在尝试正确使用 sklearn 中的管道和列转换器，但总是以错误告终。我在下面的例子中复制了它。

# Data to reproduce the error
X = pd.DataFrame([[1,  2 , 3,  1 ],
                  [1, '?', 2,  0 ],
                  [4,  5 , 6, '?']],
                 columns=['A', 'B', 'C', 'D'])

#SimpleImputer to change the values '?' with the mode
impute = SimpleImputer(missing_values='?', strategy='most_frequent')

#Simple one hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

col_transfo = ColumnTransformer(transformers=[
    ('missing_vals', impute, ['B', 'D']),
    ('one_hot', ohe, ['A', 'B'])],
    remainder='passthrough'
)

然后调用transformer如下：

col_transfo.fit_transform(X)

Returns 出现以下错误：

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']

Answer 1

它给你一个错误，因为 OneHotEncoder 只接受一种格式的数据。在您的情况下，它是 numbers 和 object 的混合。要解决此问题，您可以在 imputer 和 OneHotEncoder 之后分离管道，以便在 imputing 的输出上使用 astype 方法。类似于：

ohe.fit_transform(imputer.fit_transform(X[['A','B']]).astype(float))

Answer 2

错误不是来自 ColumnTransformer，而是来自 OneHotEncoder 对象

col_transfo = ColumnTransformer(transformers=[
    ('missing_vals', impute, ['B', 'D'])],
    remainder='passthrough'
)

col_transfo.fit_transform(X)

array([[2, 1, 1, 3], [2, 0, 1, 2], [5, 0, 4, 6]], dtype=object)

ohe.fit_transform(X)

TypeError: argument must be a string or number

OneHotEncoder 抛出此错误是因为对象获取混合类型的值（int + 字符串）以在同一列上进行编码，您需要将浮点列转换为字符串才能应用它

Answer 3

ColumnTransformer 并行应用其转换器，而不是按顺序。因此 OneHotEncoder 看到未估算的列 B 并拒绝混合类型。

在你的情况下，只对所有列进行估算，然后编码 A, B:

可能没问题

encoder = ColumnTransformer(transformers=[
    ('one_hot', ohe, ['A', 'B'])],
    remainder='passthrough'
)
preproc = Pipeline(steps=[
    ('impute', impute),
    ('encode', encoder),
    # optionally, just throw the model here...
])

如果重要的是 A,C 中的未来缺失值会导致错误，则类似地将 impute 包装到它自己的 ColumnTransformer.

中

另见

Sklearn ColumnTransformer + 管道 = TypeError

Sklearn ColumnTransformer + Pipeline = TypeError

preprocessor

pandas

scikit-learn

one-hot-encoding