带管道的 Sklearn 自定义转换器：串联轴的所有输入数组维度必须完全匹配

Question

我正在学习 sklearn 自定义转换器并了解创建自定义转换器的两种核心方法：

通过设置继承自 BaseEstimator 和 TransformerMixin 的自定义 class，或者
通过创建转换方法并将其传递给 FunctionTransformer。

我想通过实现“元矢量化器”功能来比较这两种方法：支持 CountVectorizer 或 TfidfVectorizer 并根据指定的矢量化器类型转换输入数据的矢量化器。

但是，当我将它们传递给 sklearn.pipeline.Pipeline 时，我似乎无法完成这两项工作中的任何一项。我在 fit_transform() 步骤中收到以下错误消息：

ValueError: all the input array dimensions for the concatenation axis must match 
exactly, but along dimension 0, the array at index 0 has size 6 and the array 
at index 1 has size 1

我的选项 1 代码（使用自定义 class）：

class Vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
        super().__init__()
        self.vectorizer = vectorizer
        self.ngram_range = ngram_range
    def fit(self, X, y=None):
        return self 
    def transform(self, X, y=None):
        X_vect_ = self.vectorizer.fit_transform(X.copy())
        return X_vect_.toarray()

pipe = Pipeline([
    ('column_transformer', ColumnTransformer([
        ('lesson_type_category', OneHotEncoder(), ['Type']),
        ('comment_text_vectorizer', Vectorizer(), ['Text'])],
        remainder='drop')),
    ('model', LogisticRegression())])

param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': \
[CountVectorizer(), TfidfVectorizer()]
}

randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)

我的选项 2 代码（使用 FunctionTransformer 从函数创建自定义转换器）：

def vectorize_text(X, vectorizer: Callable):
    X_vect_ = vectorizer.fit_transform(X)
    return X_vect_.toarray()

vectorizer_transformer = FunctionTransformer(vectorize_text, kw_args={'vectorizer': TfidfVectorizer()})

pipe = Pipeline([
    ('column_transformer', ColumnTransformer([
        ('lesson_type_category', OneHotEncoder(), ['Type']),
        ('comment_text_vectorizer', vectorizer_transformer, ['Text'])],
        remainder='drop')),
    ('model', LogisticRegression())])

param_dict = {'column_transformer__comment_text_vectorizer__kw_args': \
    [{'vectorizer':CountVectorizer()}, {'vectorizer': TfidfVectorizer()}]
}

randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1').fit(X_train, y_train)

导入和示例数据：

import pandas as pd 
from typing import Callable
import sklearn
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV

df = pd.DataFrame([
    ['A99', 'hi i love python very much', 'c', 1],
    ['B07', 'which programming language should i learn', 'b', 0],
    ['A12', 'what is the difference between python django flask', 'b', 1],
    ['A21', 'i want to be a programmer one day', 'c', 0],
    ['B11', 'should i learn java or python', 'b', 1],
    ['C01', 'how much can i earn as a programmer with python', 'a', 0]
], columns=['Src', 'Text', 'Type', 'Target'])

备注：

按照建议 in this question，我在矢量化后将所有稀疏矩阵转换为密集数组，如您在两种情况下所见：X_vect_.toarray().

Answer 1

问题是 CountVectorizer 和 TfidfVectorizer 都要求他们的输入是一维的（而不是二维的）。在这种情况下，ColumnTransformer 的 doc 声明 transformers 元组的参数 columns 应作为 字符串 而不是作为一个列表。

columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.

因此，以下内容适用于您的情况（即将 ['Text'] 更改为 'Text'）。

class Vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
        super().__init__()
        self.vectorizer = vectorizer
        self.ngram_range = ngram_range
    def fit(self, X, y=None):
        return self 
    def transform(self, X, y=None):
        X_vect_ = self.vectorizer.fit_transform(X.copy())
        return X_vect_.toarray()

pipe = Pipeline([
    ('column_transformer', ColumnTransformer([
        ('lesson_type_category', OneHotEncoder(handle_unknown='ignore'), ['Type']),
        ('comment_text_vectorizer', Vectorizer(), 'Text')], remainder='drop')),
    ('model', LogisticRegression())])

param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': [CountVectorizer(), TfidfVectorizer()]
}

randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)

您可以相应地使用 FunctionTransformer 调整示例。请注意，作为最后的评论，我必须将 handle_unknown='ignore' 传递给 OneHotEncoder 以防止在交叉验证的测试阶段看到未知类别的情况下出现错误的可能性（和在训练阶段没有看到）。

带管道的 Sklearn 自定义转换器：串联轴的所有输入数组维度必须完全匹配

Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly

python

pipeline

machine-learning

scikit-learn

hyperparameters

备注：