散列向量化器和计数向量化器在使用时有什么区别?
What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?
我正在尝试在 scikit-learn 中使用各种 SVM 变体以及 CountVectorizer 和 HashingVectorizer。他们在不同的示例中使用 fit 或 fit_transform,这让我感到困惑。
任何澄清将不胜荣幸。
它们的用途相似。 documentation 为 HashingVectorizer
提供了一些优缺点:
This strategy has several advantages:
- it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
- it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
- it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an
in-memory vocabulary):
- there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to
introspect which features are most important to a model.
- there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if
n_features is large enough (e.g. 2 ** 18 for text classification
problems).
- no IDF weighting as this would render the transformer stateful.
我正在尝试在 scikit-learn 中使用各种 SVM 变体以及 CountVectorizer 和 HashingVectorizer。他们在不同的示例中使用 fit 或 fit_transform,这让我感到困惑。
任何澄清将不胜荣幸。
它们的用途相似。 documentation 为 HashingVectorizer
提供了一些优缺点:
This strategy has several advantages:
- it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
- it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
- it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
- there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
- there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
- no IDF weighting as this would render the transformer stateful.