散列向量化器和计数向量化器在使用时有什么区别？

What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?

我正在尝试在 scikit-learn 中使用各种 SVM 变体以及 CountVectorizer 和 HashingVectorizer。他们在不同的示例中使用 fit 或 fit_transform，这让我感到困惑。

任何澄清将不胜荣幸。

它们的用途相似。 documentation 为 HashingVectorizer 提供了一些优缺点：

This strategy has several advantages:

it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory

it is fast to pickle and un-pickle as it holds no state besides the constructor parameters

it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

no IDF weighting as this would render the transformer stateful.

散列向量化器和计数向量化器在使用时有什么区别？

What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?

classification

machine-learning

svm

scikit-learn