Spark 中的 HashingTF 和 CountVectorizer 有什么区别?

What is the difference between HashingTF and CountVectorizer in Spark?

正在尝试在 Spark 中进行文档分类。我不确定哈希在 HashingTF 中做了什么;它会牺牲任何准确性吗?我怀疑,但我不知道。 spark 文档说它使用 "hashing trick"... 这只是工程师使用的真正 bad/confusing 命名的另一个例子(我也有罪)。 CountVectorizer 还需要设置词汇量大小,但它有另一个参数,即阈值参数,可用于排除文本语料库中出现低于某个阈值的单词或标记。我不明白这两个变形金刚之间的区别。使这一点变得重要的是算法中的后续步骤。例如,如果我想对生成的 tfidf 矩阵执行 SVD,那么词汇量大小将决定 SVD 矩阵的大小,这会影响代码的 运行 时间和模型性能等。我有除了 API 文档和没有深度的非常琐碎的示例之外,一般很难找到有关 Spark Mllib 的任何来源。

散列技巧实际上是特征散列的另一个名称。

我引用的是维基百科的定义:

In machine learning, feature hashing, also known as the hashing trick, by analogy to the kernel trick, is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array.

您可以在 this paper 中阅读更多相关信息。

所以实际上 space 有效的特征矢量化。

CountVectorizer 仅执行词汇提取并将其转换为向量。

几个重要的区别:

  • 部分可逆 (CountVectorizer) vs 不可逆 (HashingTF) - 因为散列是不可逆的你无法从哈希向量恢复原始输入。另一方面,带有模型(索引)的计数向量可用于恢复无序输入。因此,使用散列输入创建的模型可能更难解释和监控。
  • 内存和计算开销 - HashingTF 只需要一次数据扫描,除了原始输入和向量之外不需要额外的内存。 CountVectorizer 需要额外扫描数据以构建模型并需要额外内存来存储词汇表(索引)。在一元语言模型的情况下,这通常不是问题,但在更高的 n-gram 的情况下,它可能非常昂贵或不可行。
  • 散列取决于向量的大小、散列函数和文档。计数取决于向量、训练语料库和文档的大小。
  • 信息丢失的来源 - 在HashingTF的情况下,它是降维并可能发生冲突。 CountVectorizer 丢弃不常见的标记。它如何影响下游模型取决于特定的用例和数据。

根据 Spark 2.1.0 文档,

Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.

HashingTF

HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing.

To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the columns. The default feature dimension is 2^18=262,144. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

CountVectorizer

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

示例代码

from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import CountVectorizer

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark"),
    (0.0, "I wish Java could use case classes"),
    (1.0, "Logistic regression models are neat")],
 ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)

hashingTF = HashingTF(inputCol="words", outputCol="Features", numFeatures=100)
hashingTF_model = hashingTF.transform(wordsData)
print "Out of hashingTF function"
hashingTF_model.select('words',col('Features').alias('Features(vocab_size,[index],[tf])')).show(truncate=False)
    

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="Features", vocabSize=20)

cv_model = cv.fit(wordsData)

cv_result = model.transform(wordsData)
print "Out of CountVectorizer function"
cv_result.select('words',col('Features').alias('Features(vocab_size,[index],[tf])')).show(truncate=False)
print "Vocabulary from CountVectorizerModel is \n" + str(cv_model.vocabulary)

输出如下

哈希 TF 遗漏了 LDA 等技术所必需的词汇表。为此,必须使用 CountVectorizer 函数。 与词汇量大小无关,CountVectorizer 函数估计词频时不涉及任何近似值,这与 HashingTF 不同。

参考:

https://spark.apache.org/docs/latest/ml-features.html#tf-idf

https://spark.apache.org/docs/latest/ml-features.html#countvectorizer

答案很棒。我只想强调这个 API 区别:

例如

 CountVectorizer(inputCol="words", outputCol="features")
      .fit(original_df)
      .transform(original_df)

对比:

 HashingTF(inputCol="words", outputCol="features")
      .transform(original_df)

在这个 API 差异中 CountVectorizer 有一个额外的 fit API 步骤。也许这是因为 CountVectorizer 做了额外的工作(参见已接受的答案):

CountVectorizer requires additional scan over the data to build a model and additional memory to store vocabulary (index).

我认为如果您能够直接创建 CountVectorizerModel,您也可以跳过拟合步骤,如 shown in example:

// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features")

cvModel.transform(df).show(false)

又是一大不同!

  • HashingTF 可能会造成 碰撞 !这意味着两个不同的 features/words 被视为 相同的术语
  • 接受的答案是这样说的:

    a source of the information loss - in case of HashingTF it is dimensionality reduction with possible collisions

对于显式较低的 numFeatures 值(pow(2,4)pow(2,8)),这尤其是一个问题;默认值相当高 (pow(2,20)) 在这个例子中:

wordsData = spark.createDataFrame([([
    'one', 'two', 'three', 'four', 'five', 
    'six',  'seven', 'eight', 'nine', 'ten'],)], ['tokens'])
hashing = HashingTF(inputCol="tokens", outputCol="hashedValues", numFeatures=pow(2,4))
hashed_df = hashing.transform(wordsData)
hashed_df.show(truncate=False)


+-----------------------------------------------------------+
|hashedValues                                               |
+-----------------------------------------------------------+
|(16,[0,1,2,6,8,11,12,13],[1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0])|
+-----------------------------------------------------------+

输出包含16个"hash buckets"(因为我用了numFeatures=pow(2,4)

...16...

虽然我的输入有 10 个唯一标记,输出只创建了 8 个唯一哈希(由于哈希冲突);

....v-------8x-------v....
...[0,1,2,6,8,11,12,13]...

哈希冲突意味着 3 个不同的令牌被赋予相同的哈希,(即使所有令牌都是唯一的;并且应该只出现 1 次)

...---------------v
... [1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0] ...

(所以保留默认值,或者increase your numFeatures to try to avoid collisions :

This [Hashing] approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets of the hash table.

其他一些API差异

  • CountVectorizer 构造函数(即初始化时)支持额外的参数:

    • minDF
    • minTF
    • 等...
  • CountVectorizerModel has a vocabulary member,所以你可以看到生成的vocabulary(如果你fit你的CountVectorizer特别有用):

    • countVectorizerModel.vocabulary
    • >>> [u'one', u'two', ...]
  • CountVectorizer 是 "reversible" 正如主要答案所说!使用其 vocabulary 成员,这是一个数组,将术语索引映射到术语