sklearn:Can不让 OneHotEncoder 与 Pipeline 一起工作
sklearn:Can't make OneHotEncoder work with Pipeline
我正在使用 ColumnTransformer.This 为模型构建管道,我的管道是这样的,
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,MinMaxScaler
from sklearn.impute import KNNImputer
imputer_transformer = ColumnTransformer([
('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')
category_transformer = ColumnTransformer([
("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,6]),
("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],handle_unknown='ignore',dtype=np.int16),[3]),
("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[2,5]),
],remainder='passthrough')
def build_pipeline_with_estimator(estimator):
return Pipeline([
('imputer',imputer_transformer),
('category_transformer',category_transformer),
('estimator',estimator),
])
这就是我的数据集的样子,
kms_driven owner location mileage power brand engine age
34000.0 first other NaN 12.0 Yamaha 150.0 9
28000.0 first other 72.0 7.0 Hero 100.0 16
5947.0 first other 53.0 19.0 Bajaj NaN 4
11000.0 first delhi 40.0 19.8 Royal Enfield 350.0 7
13568.0 first delhi 63.0 14.0 Suzuki 150.0 5
这就是我在管道中使用 LinearRegression 的方式。
linear_regressor = build_pipeline_with_estimator(LinearRegression())
linear_regressor.fit(X_train,y_train)
print('Linear Regression Train Performance.\n')
print(model_perf(linear_regressor,X_train,y_train))
print('Linear Regression Test Performance.\n')
print(model_perf(linear_regressor,X_test,y_test))
现在,每当我尝试对管道应用线性回归时,我都会收到此错误,
ValueError: could not convert string to float: 'bangalore'
'banglore' 是位置特征中的值之一,我正在尝试对其进行单热编码,但它失败了,我无法弄清楚出了什么问题here.Any 帮助将不胜感激。
通过插补后,非插补列将移至右侧,如 the documentation 下注释中所述:
Columns of the original feature matrix that are not specified are
dropped from the resulting transformed feature matrix, unless
specified in the passthrough keyword. Those columns specified with
passthrough are added at the right to the output of the transformers.
我们可以先尝试使用imputer:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
imputer_transformer = ColumnTransformer([
('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')
我们可以尝试使用示例数据,您会看到您的分类列现在右移了:
X_train = pd.DataFrame({'kms':[0,1,2],'owner':['first','first','second'],
'location':['other','other','delhi'],'mileage':[9,8,np.nan],
'power':[3,2,1],'brand':['A','B','C'],'engine':[10,100,1000],'age':[3,4,5]})
imputer_transformer.fit_transform(X_train)
Out[25]:
array([[0.0, 9.0, 3.0, 10.0, 3.0, 'first', 'other', 'A'],
[1.0, 8.0, 2.0, 100.0, 4.0, 'first', 'other', 'B'],
[2.0, 8.5, 1.0, 1000.0, 5.0, 'second', 'delhi', 'C']], dtype=object)
在您的情况下,您可以看到 engine
列现在是第四列,而您的序数是第五列,最后两列是绝对的,因此一个简单的解决方案可能是:
category_transformer = ColumnTransformer([
("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,3]),
("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],
handle_unknown='ignore',dtype=np.int16),[5]),
("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[6,7]),
],remainder='passthrough')
y_train = [7,3,2]
linear_regressor = build_pipeline_with_estimator(LinearRegression())
linear_regressor.fit(X_train,y_train)
我正在使用 ColumnTransformer.This 为模型构建管道,我的管道是这样的,
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,MinMaxScaler
from sklearn.impute import KNNImputer
imputer_transformer = ColumnTransformer([
('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')
category_transformer = ColumnTransformer([
("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,6]),
("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],handle_unknown='ignore',dtype=np.int16),[3]),
("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[2,5]),
],remainder='passthrough')
def build_pipeline_with_estimator(estimator):
return Pipeline([
('imputer',imputer_transformer),
('category_transformer',category_transformer),
('estimator',estimator),
])
这就是我的数据集的样子,
kms_driven owner location mileage power brand engine age
34000.0 first other NaN 12.0 Yamaha 150.0 9
28000.0 first other 72.0 7.0 Hero 100.0 16
5947.0 first other 53.0 19.0 Bajaj NaN 4
11000.0 first delhi 40.0 19.8 Royal Enfield 350.0 7
13568.0 first delhi 63.0 14.0 Suzuki 150.0 5
这就是我在管道中使用 LinearRegression 的方式。
linear_regressor = build_pipeline_with_estimator(LinearRegression())
linear_regressor.fit(X_train,y_train)
print('Linear Regression Train Performance.\n')
print(model_perf(linear_regressor,X_train,y_train))
print('Linear Regression Test Performance.\n')
print(model_perf(linear_regressor,X_test,y_test))
现在,每当我尝试对管道应用线性回归时,我都会收到此错误,
ValueError: could not convert string to float: 'bangalore'
'banglore' 是位置特征中的值之一,我正在尝试对其进行单热编码,但它失败了,我无法弄清楚出了什么问题here.Any 帮助将不胜感激。
通过插补后,非插补列将移至右侧,如 the documentation 下注释中所述:
Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
我们可以先尝试使用imputer:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
imputer_transformer = ColumnTransformer([
('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')
我们可以尝试使用示例数据,您会看到您的分类列现在右移了:
X_train = pd.DataFrame({'kms':[0,1,2],'owner':['first','first','second'],
'location':['other','other','delhi'],'mileage':[9,8,np.nan],
'power':[3,2,1],'brand':['A','B','C'],'engine':[10,100,1000],'age':[3,4,5]})
imputer_transformer.fit_transform(X_train)
Out[25]:
array([[0.0, 9.0, 3.0, 10.0, 3.0, 'first', 'other', 'A'],
[1.0, 8.0, 2.0, 100.0, 4.0, 'first', 'other', 'B'],
[2.0, 8.5, 1.0, 1000.0, 5.0, 'second', 'delhi', 'C']], dtype=object)
在您的情况下,您可以看到 engine
列现在是第四列,而您的序数是第五列,最后两列是绝对的,因此一个简单的解决方案可能是:
category_transformer = ColumnTransformer([
("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,3]),
("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],
handle_unknown='ignore',dtype=np.int16),[5]),
("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[6,7]),
],remainder='passthrough')
y_train = [7,3,2]
linear_regressor = build_pipeline_with_estimator(LinearRegression())
linear_regressor.fit(X_train,y_train)