HashPartitioner 是如何工作的？

Question

我阅读了 HashPartitioner 的文档。不幸的是，除了 API 调用外，没有太多解释。我假设 HashPartitioner 根据键的散列对分布式集进行分区。例如，如果我的数据是

(1,1), (1,2), (1,3), (2,1), (2,2), (2,3)

所以分区器会将其放入不同的分区，相同的键落在同一个分区中。但是我不明白构造函数参数的意义

new HashPartitoner(numPartitions) //What does numPartitions do?

对于上面的数据集，如果我这样做，结果会有什么不同

new HashPartitoner(1)
new HashPartitoner(2)
new HashPartitoner(10)

那么 HashPartitioner 实际上是如何工作的？

Answer 1

RDD 是分布式的，这意味着它被分成了一些部分。每个分区都可能位于不同的机器上。带有参数 numPartitions 的散列分区器按以下方式选择在哪个分区上放置对 (key, value)：

(key, value)

Answer 2

好吧，让你的数据集稍微有趣一点：

val rdd = sc.parallelize(for {
    x <- 1 to 3
    y <- 1 to 2
} yield (x, None), 8)

我们有六个元素：

rdd.count

Long = 6

没有分区器：

rdd.partitioner

Option[org.apache.spark.Partitioner] = None

和八个分区：

rdd.partitions.length

Int = 8

现在让我们定义一个小助手来计算每个分区的元素数量：

import org.apache.spark.rdd.RDD

def countByPartition(rdd: RDD[(Int, None.type)]) = {
    rdd.mapPartitions(iter => Iterator(iter.length))
}

由于我们没有分区器，我们的数据集在分区之间均匀分布 ()：

countByPartition(rdd).collect()

Array[Int] = Array(0, 1, 1, 1, 0, 1, 1, 1)

现在让我们重新划分我们的数据集：

import org.apache.spark.HashPartitioner
val rddOneP = rdd.partitionBy(new HashPartitioner(1))

由于传递给 HashPartitioner 的参数定义了我们期望一个分区的分区数：

rddOneP.partitions.length

Int = 1

因为我们只有一个分区，所以它包含所有元素：

countByPartition(rddOneP).collect

Array[Int] = Array(6)

请注意，洗牌后值的顺序是不确定的。

同样的方法，如果我们使用 HashPartitioner(2)

val rddTwoP = rdd.partitionBy(new HashPartitioner(2))

我们将得到 2 个分区：

rddTwoP.partitions.length

Int = 2

由于rdd是按关键数据分区的，数据不会再均匀分布了：

countByPartition(rddTwoP).collect()

Array[Int] = Array(2, 4)

因为有三个键，只有两个不同的值 hashCode mod numPartitions 这里没有什么意外的：

(1 to 3).map((k: Int) => (k, k.hashCode, k.hashCode % 2))

scala.collection.immutable.IndexedSeq[(Int, Int, Int)] = Vector((1,1,1), (2,2,0), (3,3,1))

确认以上：

rddTwoP.mapPartitions(iter => Iterator(iter.map(_._1).toSet)).collect()

Array[scala.collection.immutable.Set[Int]] = Array(Set(2), Set(1, 3))

最后 HashPartitioner(7) 我们得到七个分区，三个非空分区，每个分区有 2 个元素：

val rddSevenP = rdd.partitionBy(new HashPartitioner(7))
rddSevenP.partitions.length

Int = 7

countByPartition(rddTenP).collect()

Array[Int] = Array(0, 2, 2, 2, 0, 0, 0)

HashPartitioner 接受一个定义分区数的参数
使用 hash 个键将值分配给分区。 hash 函数可能因语言而异（Scala RDD 可能使用 hashCode、DataSets 使用 MurmurHash 3、PySpark、portable_hash）。

在这种简单的情况下，key 是一个小整数，您可以假设 hash 是一个身份 (i = hash(i))。

Scala API 使用 nonNegativeMod 根据计算的哈希确定分区，
如果密钥分布不均匀，您可能会遇到部分集群空闲的情况
键必须是可散列的。您可以查看我对 to read about PySpark specific issues. Another possible problem is highlighted by HashPartitioner documentation:
的回答

Java arrays have hashCodes that are based on the arrays' identities rather than their contents, so attempting to partition an RDD[Array[]] or RDD[(Array[], _)] using a HashPartitioner will produce an unexpected or incorrect result.
在Python 3 中，您必须确保散列是一致的。参见
哈希分区器既不是单射也不是满射。多个键可以分配给一个分区，一些分区可以保留为空。
请注意，当前基于散列的方法与 REPL 定义的案例结合使用时在 Scala 中不起作用类 ().
HashPartitioner（或任何其他 Partitioner）随机排列数据。除非在多个操作之间重复使用分区，否则它不会减少要混洗的数据量。

Answer 3

HashPartitioner.getPartition 方法以 key 作为参数，returns 分区的 index钥匙属于。分区程序必须知道有效索引是什么，因此它 returns 数字在正确的范围内。分区数通过 numPartitions 构造函数参数指定。

实现returns大致key.hashCode() % numPartitions。有关详细信息，请参阅 Partitioner.scala。

How does HashPartitioner work?