ColumnTransformer 在管道中失败并显示 CountVectorizer/HashingVectorizer(多个文本特征)
ColumnTransformer fails with CountVectorizer/HashingVectorizer in a pipeline (multiple textfeatures)
与此问题类似 () 我想在管道中使用 ColumnTransformer
在具有文本功能的列上应用 CountVectorizer/HashingVectorizer
。但是我没有只有一个文本功能,而是多个。如果我传递单个功能(不是列表,就像另一个问题的解决方案中所建议的那样)它工作正常,我该如何为多个做?
numeric_features = ['x0', 'x1', 'y0', 'y1']
categorical_features = []
text_features = ['text_feature', 'another_text_feature']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('encoder', OneHotEncoder())])
text_transformer = Pipeline(steps=[('hashing', HashingVectorizer())])
preprocessor = ColumnTransformer(transformers=[
('numeric', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features),
('text', text_transformer, text_features)
])
steps = [('preprocessor', preprocessor),
('clf', SGDClassifier())]
pipeline = Pipeline(steps=steps)
pipeline.fit(X_train, y_train)
只需为每个文本特征使用一个单独的转换器。
preprocessor = ColumnTransformer(transformers=[
('numeric', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features),
('text', text_transformer, 'text_feature'),
('more_text', text_transformer, 'another_text_feature'),
])
(变压器在拟合过程中被克隆,所以你将有两个单独的 text_transformer
副本,一切都很好。如果你担心像这样指定同一个变压器两次,你总是可以 copy/clone 它在指定 ColumnTransformer
之前手动设置。)
与此问题类似 (ColumnTransformer
在具有文本功能的列上应用 CountVectorizer/HashingVectorizer
。但是我没有只有一个文本功能,而是多个。如果我传递单个功能(不是列表,就像另一个问题的解决方案中所建议的那样)它工作正常,我该如何为多个做?
numeric_features = ['x0', 'x1', 'y0', 'y1']
categorical_features = []
text_features = ['text_feature', 'another_text_feature']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('encoder', OneHotEncoder())])
text_transformer = Pipeline(steps=[('hashing', HashingVectorizer())])
preprocessor = ColumnTransformer(transformers=[
('numeric', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features),
('text', text_transformer, text_features)
])
steps = [('preprocessor', preprocessor),
('clf', SGDClassifier())]
pipeline = Pipeline(steps=steps)
pipeline.fit(X_train, y_train)
只需为每个文本特征使用一个单独的转换器。
preprocessor = ColumnTransformer(transformers=[
('numeric', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features),
('text', text_transformer, 'text_feature'),
('more_text', text_transformer, 'another_text_feature'),
])
(变压器在拟合过程中被克隆,所以你将有两个单独的 text_transformer
副本,一切都很好。如果你担心像这样指定同一个变压器两次,你总是可以 copy/clone 它在指定 ColumnTransformer
之前手动设置。)