为什么我在 Sklearn 管道中使用 OneHotEncoding 后得到的列比预期的多？

Question

我正在使用 sklearn 管道来预处理我的数据。

from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler()),
    ('imputer', KNNImputer(n_neighbors=2,weights='uniform', metric='nan_euclidean', add_indicator=True))
   ])
categorical_transformer = Pipeline(steps=[ 
    ('one_hot_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore'))])


from sklearn.compose import make_column_selector as selector

numeric_features = ['Latitud','Longitud','Habitaciones','Dormitorios','Baños','Superficie_Total','Superficie_cubierta']
categorical_features = ['Tipo_de_propiedad']
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
   transformers=[
    ('numeric', numeric_transformer, numeric_features, selector(dtype_exclude="category"))
   ,('categorical', categorical_transformer, categorical_features, selector(dtype_include="category"))])

特征 Tipo_de_propiedad 有 3 个类：'Departamento'、'Casa'、'PH'。所以 7 个其他特征加上这些假人应该在转换后给我 10 个，但是当我应用 fit_transform 时，它 returns 14 个特征。

train_transfor=pd.DataFrame(preprocessor.fit_transform(X_train))
train_transfor.head()

当我使用 pd.get_dummies 时效果很好，但我不能用它来申请 Pipeline； OneHotEncoder 很有用，因为我可以适应训练集并在测试集上进行转换。

dummy=pd.get_dummies(df30[["Tipo_de_propiedad"]])
df_new=pd.concat([df30,dummy],axis=1)
df_new.head()

Answer 1

您的 KNNImputer 使用了参数 add_indicator=True，因此额外的列可能是您某些数字列的缺失指示符。

为什么我在 Sklearn 管道中使用 OneHotEncoding 后得到的列比预期的多？

Why I get more columns than expected after OneHotEncoding in a Sklearn Pipeline?

python

pipeline

scikit-learn

one-hot-encoding