使用`TfidfVectorizer`、`CountVectorizer` 等时需要对标签进行编码吗？

Question

在处理文本数据时，我理解需要将文本标签编码为某种数字表示形式（即，通过使用 LabelEncoder、OneHotEncoder 等）

但是，我的问题是，当您使用某些特征提取 class（即 TfidfVectorizer、CountVectorizer 等）时，您是否需要明确执行此步骤，或者这些是否会在后台为您编码标签吗？

如果您确实需要自己单独对标签进行编码，您是否可以在 Pipeline（例如下面的）

中执行此步骤

    pipeline = Pipeline(steps=[
        ('tfidf', TfidfVectorizer()),
        ('sgd', SGDClassifier())
    ])

或者您是否需要预先对标签进行编码，因为管道需要 fit() 和 transform() 数据（而不是标签）？

Answer 1

查看 scikit-learn 词汇表中的术语 transform:

In a transformer, transforms the input, usually only X, into some transformed space (conventionally notated as Xt). Output is an array or sparse matrix of length n_samples and with the number of columns fixed after fitting.

事实上，几乎所有的transformer都只是对特征进行变换。 TfidfVectorizer 和 CountVectorizer 也是如此。如果有疑问，您可以随时检查转换函数的 return 类型（例如 CountVectorizer 的 fit_transform 方法）。

当您 assemble 管道中有多个变压器时也是如此。它在其 user guide:

中说明

Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).

总而言之，您通常会在适合 estimator/pipeline 之前单独处理标签。

使用`TfidfVectorizer`、`CountVectorizer` 等时需要对标签进行编码吗？

It is necessary to encode labels when using `TfidfVectorizer`, `CountVectorizer` etc?

python

machine-learning

python-3.x

scikit-learn

scikit-learn-pipeline