使用`TfidfVectorizer`、`CountVectorizer` 等时需要对标签进行编码吗?
It is necessary to encode labels when using `TfidfVectorizer`, `CountVectorizer` etc?
在处理文本数据时,我理解需要将文本标签编码为某种数字表示形式(即,通过使用 LabelEncoder
、OneHotEncoder
等)
但是,我的问题是,当您使用某些特征提取 class(即 TfidfVectorizer
、CountVectorizer
等)时,您是否需要明确执行此步骤,或者这些是否会在后台为您编码标签吗?
如果您确实需要自己单独对标签进行编码,您是否可以在 Pipeline
(例如下面的)
中执行此步骤
pipeline = Pipeline(steps=[
('tfidf', TfidfVectorizer()),
('sgd', SGDClassifier())
])
或者您是否需要预先对标签进行编码,因为管道需要 fit()
和 transform()
数据(而不是标签)?
查看 scikit-learn
词汇表中的术语 transform:
In a transformer, transforms the input, usually only X, into some transformed space (conventionally notated as Xt). Output is an array or sparse matrix of length n_samples and with the number of columns fixed after fitting.
事实上,几乎所有的transformer都只是对特征进行变换。 TfidfVectorizer
和 CountVectorizer
也是如此。如果有疑问,您可以随时检查转换函数的 return 类型(例如 CountVectorizer
的 fit_transform
方法)。
当您 assemble 管道中有多个变压器时也是如此。它在其 user guide:
中说明
Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).
总而言之,您通常会在适合 estimator/pipeline 之前单独处理标签。
在处理文本数据时,我理解需要将文本标签编码为某种数字表示形式(即,通过使用 LabelEncoder
、OneHotEncoder
等)
但是,我的问题是,当您使用某些特征提取 class(即 TfidfVectorizer
、CountVectorizer
等)时,您是否需要明确执行此步骤,或者这些是否会在后台为您编码标签吗?
如果您确实需要自己单独对标签进行编码,您是否可以在 Pipeline
(例如下面的)
pipeline = Pipeline(steps=[
('tfidf', TfidfVectorizer()),
('sgd', SGDClassifier())
])
或者您是否需要预先对标签进行编码,因为管道需要 fit()
和 transform()
数据(而不是标签)?
查看 scikit-learn
词汇表中的术语 transform:
In a transformer, transforms the input, usually only X, into some transformed space (conventionally notated as Xt). Output is an array or sparse matrix of length n_samples and with the number of columns fixed after fitting.
事实上,几乎所有的transformer都只是对特征进行变换。 TfidfVectorizer
和 CountVectorizer
也是如此。如果有疑问,您可以随时检查转换函数的 return 类型(例如 CountVectorizer
的 fit_transform
方法)。
当您 assemble 管道中有多个变压器时也是如此。它在其 user guide:
中说明Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).
总而言之,您通常会在适合 estimator/pipeline 之前单独处理标签。