sklearn:Can不让 OneHotEncoder 与 Pipeline 一起工作

Question

我正在使用 ColumnTransformer.This 为模型构建管道，我的管道是这样的，

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,MinMaxScaler
from sklearn.impute import KNNImputer

imputer_transformer = ColumnTransformer([
    ('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')

category_transformer = ColumnTransformer([
    ("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,6]),
    ("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],handle_unknown='ignore',dtype=np.int16),[3]),
    ("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[2,5]),
],remainder='passthrough')


def build_pipeline_with_estimator(estimator):
    return Pipeline([
    ('imputer',imputer_transformer),
    ('category_transformer',category_transformer),
    ('estimator',estimator),
])

这就是我的数据集的样子，

kms_driven      owner   location    mileage     power    brand              engine  age
34000.0         first       other           NaN         12.0        Yamaha          150.0     9
28000.0         first       other           72.0         7.0         Hero                100.0    16
5947.0           first       other          53.0          19.0       Bajaj                NaN       4
11000.0         first       delhi           40.0          19.8       Royal Enfield   350.0    7
13568.0         first       delhi           63.0          14.0       Suzuki             150.0     5

这就是我在管道中使用 LinearRegression 的方式。

linear_regressor = build_pipeline_with_estimator(LinearRegression())

linear_regressor.fit(X_train,y_train)

print('Linear Regression Train Performance.\n')
print(model_perf(linear_regressor,X_train,y_train))

print('Linear Regression Test Performance.\n')
print(model_perf(linear_regressor,X_test,y_test))

现在，每当我尝试对管道应用线性回归时，我都会收到此错误，

ValueError: could not convert string to float: 'bangalore'

'banglore' 是位置特征中的值之一，我正在尝试对其进行单热编码，但它失败了，我无法弄清楚出了什么问题here.Any 帮助将不胜感激。

Answer 1

通过插补后，非插补列将移至右侧，如 the documentation 下注释中所述：

Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

我们可以先尝试使用imputer:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression

imputer_transformer = ColumnTransformer([
    ('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')

我们可以尝试使用示例数据，您会看到您的分类列现在右移了：

X_train = pd.DataFrame({'kms':[0,1,2],'owner':['first','first','second'],
'location':['other','other','delhi'],'mileage':[9,8,np.nan],
'power':[3,2,1],'brand':['A','B','C'],'engine':[10,100,1000],'age':[3,4,5]})

imputer_transformer.fit_transform(X_train)
Out[25]: 
array([[0.0, 9.0, 3.0, 10.0, 3.0, 'first', 'other', 'A'],
       [1.0, 8.0, 2.0, 100.0, 4.0, 'first', 'other', 'B'],
       [2.0, 8.5, 1.0, 1000.0, 5.0, 'second', 'delhi', 'C']], dtype=object)

在您的情况下，您可以看到 engine 列现在是第四列，而您的序数是第五列，最后两列是绝对的，因此一个简单的解决方案可能是：

category_transformer = ColumnTransformer([
    ("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,3]),
    ("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],
handle_unknown='ignore',dtype=np.int16),[5]),
    ("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[6,7]),
],remainder='passthrough')

y_train = [7,3,2]

linear_regressor = build_pipeline_with_estimator(LinearRegression())

linear_regressor.fit(X_train,y_train)

sklearn:Can不让 OneHotEncoder 与 Pipeline 一起工作

sklearn:Can't make OneHotEncoder work with Pipeline

python

scikit-learn

one-hot-encoding