使用`TfidfVectorizer`、`CountVectorizer` 等时需要对标签进行编码吗?

It is necessary to encode labels when using `TfidfVectorizer`, `CountVectorizer` etc?

在处理文本数据时,我理解需要将文本标签编码为某种数字表示形式(即,通过使用 LabelEncoderOneHotEncoder 等)

但是,我的问题是,当您使用某些特征提取 class(即 TfidfVectorizerCountVectorizer 等)时,您是否需要明确执行此步骤,或者这些是否会在后台为您编码标签吗?

如果您确实需要自己单独对标签进行编码,您是否可以在 Pipeline(例如下面的)

中执行此步骤
    pipeline = Pipeline(steps=[
        ('tfidf', TfidfVectorizer()),
        ('sgd', SGDClassifier())
    ])

或者您是否需要预先对标签进行编码,因为管道需要 fit()transform() 数据(而不是标签)?

查看 scikit-learn 词汇表中的术语 transform:

In a transformer, transforms the input, usually only X, into some transformed space (conventionally notated as Xt). Output is an array or sparse matrix of length n_samples and with the number of columns fixed after fitting.

事实上,几乎所有的transformer都只是对特征进行变换。 TfidfVectorizerCountVectorizer 也是如此。如果有疑问,您可以随时检查转换函数的 return 类型(例如 CountVectorizerfit_transform 方法)。

当您 assemble 管道中有多个变压器时也是如此。它在其 user guide:

中说明

Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).

总而言之,您通常会在适合 estimator/pipeline 之前单独处理标签。