二维特征维度推荐推理的Spark Hashing TF power

Spark Hashing TF power of two feature dimension recommendation reasoning

根据https://spark.apache.org/docs/2.3.0/ml-features.html#tf-idf：

"HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3." ... "Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices."

我试图理解为什么使用 2 的幂作为特征维度会均匀地映射单词，并尝试在互联网上找到一些有用的文档来理解它，但两次尝试都没有成功。

是否有人知道或有有用的资源来说明为什么使用二次幂将单词均匀地映射到向量索引？

散列函数的输出是 b 位，即，有 2^b 个可能的值可以用来散列一个特征。此外，我们假设 2^b 个可能值随机均匀出现。

如果d是特征维度，则特征f的索引被确定为hash(f) MOD d。同样，hash(f) 采用 2^b 可能的值。很容易看出 d 本身必须是 2 的幂（即 2^b 的约数）才能保持 均匀性 .

作为反例，考虑一个 2 位哈希函数和一个 3 维特征 space。根据我们的假设，哈希函数输出 0、1、2 或 3，每个概率为 1/4。但是，取 mod 3 结果为 0 的概率为 1/2，或者 1 或 2 的概率为 1/4。因此，不能保持均匀性。另一方面;如果特征 space 是二维的，很容易看出结果将是 0 或 1，每个概率为 1/2。

二维特征维度推荐推理的Spark Hashing TF power

Spark Hashing TF power of two feature dimension recommendation reasoning

tf-idf

apache-spark