Scikit-learn Column Transformer 不 return 返回特征名称

Scikit-learn Column Transformer does not return back feature names

我正在尝试将 Column Transformer 与 OneHotEncoder 结合使用来转换我的分类数据:

快速浏览一下我的数据:

我想对 3 个特征进行单热编码:'sex'、'smoker'、'region',所以我使用 scikit-learn 的 Column Transformer。 (我不想将数值型和分类型分开,而不是单独转换它们,我只想在单个数据集上执行它们)

我的代码:


cat_feature = X.select_dtypes(include = 'object') #select only categorical columns 

enc = ColumnTransformer([ ('one_hot_encoder' , OneHotEncoder() , cat_feature ) ] , 
                     remainder = 'passthrough')

X_transformed  =  enc.fit_transform(X)   # transformed version of original data


我的问题是,X_transformed 然后删除了所有功能名称,这让我难以调试:

那么在进行此转换后是否仍然保留我的列名称?我想将这个转换器合并到一个管道中,所以我不能使用 pd.get_dummies。 谢谢!!

用户必须编写自定义 Transformer 来实现直通并支持 get_feature_names

步骤:

  1. 自定义 Transformer 将 return 通过 get_feature_names
  2. 传递列名称
  3. 不要使用 remainder = 'passthrough',而是使用我们的习惯 Transformer

使用enc.get_feature_names()获取功能列表。

样本:

from sklearn.base import BaseEstimator

df = pd.DataFrame({
    'age': [1,2,3,4],
    'sex': ['male', 'female']*2,
    'bmi': [1.1,2.2,3.3,4.4],
    'children': [1]*4,
    'smoker': ['yes', 'no']*2
})
cat_features = df.select_dtypes(include = 'object').columns
passthrough_features = [c for c in df.columns if c not in cat_features]

class PassthroughTransformer(BaseEstimator):
  def fit(self, X, y = None):
    self.cols = X.columns
    return self

  def transform(self, X, y = None):
    self.cols = X.columns
    return X.values

  def get_feature_names(self):
    return self.cols

enc = ColumnTransformer([ ('1hot' , OneHotEncoder() , cat_features ),
                         ('pass' , PassthroughTransformer(), passthrough_features)])
X_transformed   = enc.fit_transform(df)
pd.DataFrame(X_transformed, columns=enc.get_feature_names())

输出:

    1hot__x0_female 1hot__x0_male   1hot__x1_no 1hot__x1_yes    pass__age   pass__bmi   pass__children
0   0.0             1.0             0.0         1.0             1.0         1.1         1.0
1   1.0             0.0             1.0         0.0             2.0         2.2         1.0
2   0.0             1.0             0.0         1.0             3.0         3.3         1.0
3    1.0            0.0             1.0         0.0             4.0         4.4         1.0

举个例子,希望对你有帮助: (为了回答你的问题,我使用 OneHotEncoder 中的 get_feature_names

import pandas as pd
from sklearn.preprocessing import OneHotEncoder


d = {
    'Sex': ['female', 'male', 'male'],
    'BMI': [27 , 33 , 31 ],
    'REG': ['south', 'west', 'south']
}

X = pd.DataFrame(d)
print(X)

cat_feature = X.select_dtypes(include = 'object') #select only categorical columns 

enc = OneHotEncoder(handle_unknown='ignore')

X_transformed = enc.fit_transform(cat_feature).toarray().tolist()

X_transformed = pd.DataFrame(X_transformed)

X_transformed.columns = enc.get_feature_names()
      Sex  BMI    REG
0  female   27  south
1    male   33   west
2    male   31  south