Scikit-learn Column Transformer 不 return 返回特征名称
Scikit-learn Column Transformer does not return back feature names
我正在尝试将 Column Transformer 与 OneHotEncoder 结合使用来转换我的分类数据:
快速浏览一下我的数据:
我想对 3 个特征进行单热编码:'sex'、'smoker'、'region',所以我使用 scikit-learn 的 Column Transformer。 (我不想将数值型和分类型分开,而不是单独转换它们,我只想在单个数据集上执行它们)
我的代码:
cat_feature = X.select_dtypes(include = 'object') #select only categorical columns
enc = ColumnTransformer([ ('one_hot_encoder' , OneHotEncoder() , cat_feature ) ] ,
remainder = 'passthrough')
X_transformed = enc.fit_transform(X) # transformed version of original data
我的问题是,X_transformed
然后删除了所有功能名称,这让我难以调试:
那么在进行此转换后是否仍然保留我的列名称?我想将这个转换器合并到一个管道中,所以我不能使用 pd.get_dummies
。
谢谢!!
用户必须编写自定义 Transformer
来实现直通并支持 get_feature_names
步骤:
- 自定义
Transformer
将 return 通过 get_feature_names
传递列名称
- 不要使用
remainder = 'passthrough'
,而是使用我们的习惯 Transformer
使用enc.get_feature_names()
获取功能列表。
样本:
from sklearn.base import BaseEstimator
df = pd.DataFrame({
'age': [1,2,3,4],
'sex': ['male', 'female']*2,
'bmi': [1.1,2.2,3.3,4.4],
'children': [1]*4,
'smoker': ['yes', 'no']*2
})
cat_features = df.select_dtypes(include = 'object').columns
passthrough_features = [c for c in df.columns if c not in cat_features]
class PassthroughTransformer(BaseEstimator):
def fit(self, X, y = None):
self.cols = X.columns
return self
def transform(self, X, y = None):
self.cols = X.columns
return X.values
def get_feature_names(self):
return self.cols
enc = ColumnTransformer([ ('1hot' , OneHotEncoder() , cat_features ),
('pass' , PassthroughTransformer(), passthrough_features)])
X_transformed = enc.fit_transform(df)
pd.DataFrame(X_transformed, columns=enc.get_feature_names())
输出:
1hot__x0_female 1hot__x0_male 1hot__x1_no 1hot__x1_yes pass__age pass__bmi pass__children
0 0.0 1.0 0.0 1.0 1.0 1.1 1.0
1 1.0 0.0 1.0 0.0 2.0 2.2 1.0
2 0.0 1.0 0.0 1.0 3.0 3.3 1.0
3 1.0 0.0 1.0 0.0 4.0 4.4 1.0
举个例子,希望对你有帮助:
(为了回答你的问题,我使用 OneHotEncoder
中的 get_feature_names
)
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
d = {
'Sex': ['female', 'male', 'male'],
'BMI': [27 , 33 , 31 ],
'REG': ['south', 'west', 'south']
}
X = pd.DataFrame(d)
print(X)
cat_feature = X.select_dtypes(include = 'object') #select only categorical columns
enc = OneHotEncoder(handle_unknown='ignore')
X_transformed = enc.fit_transform(cat_feature).toarray().tolist()
X_transformed = pd.DataFrame(X_transformed)
X_transformed.columns = enc.get_feature_names()
Sex BMI REG
0 female 27 south
1 male 33 west
2 male 31 south
我正在尝试将 Column Transformer 与 OneHotEncoder 结合使用来转换我的分类数据:
快速浏览一下我的数据:
我想对 3 个特征进行单热编码:'sex'、'smoker'、'region',所以我使用 scikit-learn 的 Column Transformer。 (我不想将数值型和分类型分开,而不是单独转换它们,我只想在单个数据集上执行它们)
我的代码:
cat_feature = X.select_dtypes(include = 'object') #select only categorical columns
enc = ColumnTransformer([ ('one_hot_encoder' , OneHotEncoder() , cat_feature ) ] ,
remainder = 'passthrough')
X_transformed = enc.fit_transform(X) # transformed version of original data
我的问题是,X_transformed
然后删除了所有功能名称,这让我难以调试:
那么在进行此转换后是否仍然保留我的列名称?我想将这个转换器合并到一个管道中,所以我不能使用 pd.get_dummies
。
谢谢!!
用户必须编写自定义 Transformer
来实现直通并支持 get_feature_names
步骤:
- 自定义
Transformer
将 return 通过get_feature_names
传递列名称
- 不要使用
remainder = 'passthrough'
,而是使用我们的习惯Transformer
使用enc.get_feature_names()
获取功能列表。
样本:
from sklearn.base import BaseEstimator
df = pd.DataFrame({
'age': [1,2,3,4],
'sex': ['male', 'female']*2,
'bmi': [1.1,2.2,3.3,4.4],
'children': [1]*4,
'smoker': ['yes', 'no']*2
})
cat_features = df.select_dtypes(include = 'object').columns
passthrough_features = [c for c in df.columns if c not in cat_features]
class PassthroughTransformer(BaseEstimator):
def fit(self, X, y = None):
self.cols = X.columns
return self
def transform(self, X, y = None):
self.cols = X.columns
return X.values
def get_feature_names(self):
return self.cols
enc = ColumnTransformer([ ('1hot' , OneHotEncoder() , cat_features ),
('pass' , PassthroughTransformer(), passthrough_features)])
X_transformed = enc.fit_transform(df)
pd.DataFrame(X_transformed, columns=enc.get_feature_names())
输出:
1hot__x0_female 1hot__x0_male 1hot__x1_no 1hot__x1_yes pass__age pass__bmi pass__children
0 0.0 1.0 0.0 1.0 1.0 1.1 1.0
1 1.0 0.0 1.0 0.0 2.0 2.2 1.0
2 0.0 1.0 0.0 1.0 3.0 3.3 1.0
3 1.0 0.0 1.0 0.0 4.0 4.4 1.0
举个例子,希望对你有帮助:
(为了回答你的问题,我使用 OneHotEncoder
中的 get_feature_names
)
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
d = {
'Sex': ['female', 'male', 'male'],
'BMI': [27 , 33 , 31 ],
'REG': ['south', 'west', 'south']
}
X = pd.DataFrame(d)
print(X)
cat_feature = X.select_dtypes(include = 'object') #select only categorical columns
enc = OneHotEncoder(handle_unknown='ignore')
X_transformed = enc.fit_transform(cat_feature).toarray().tolist()
X_transformed = pd.DataFrame(X_transformed)
X_transformed.columns = enc.get_feature_names()
Sex BMI REG
0 female 27 south
1 male 33 west
2 male 31 south