TotalOrderPartitioner 和分区文件

Question

我正在学习 hadoop mapreduce，我正在使用 Java API。我了解到 TotalOrderPartitioner 用于 'globally' 在整个集群中按键对输出进行排序，并且它需要一个分区文件（使用 InputSampler 生成）：

job.setPartitionerClass(TotalOrderPartitioner.class);
InputSampler.Sampler<Text, Text> sampler = new InputSampler.RandomSampler<Text, Text>(0.1, 200);
InputSampler.writePartitionFile(job, sampler);

我有几个疑问，我向社区寻求帮助：

'sorted globally' 这个词在这里到底是什么意思？输出到底是如何排序的，我们还有多个输出部分文件分布在集群中？
如果我们不提供分区文件会怎样？有没有默认的方法来处理这种情况？

Answer 1

让我们举个例子来解释一下。假设您的分区文件如下所示：

H
T
V

当您的按键范围从 A 到 Z 时，这可以弥补 4 个范围：

1 [A,H)
2 [H,T)
3 [T,V)
4 [V,Z]

当映射器现在向缩减器发送记录时，分区器会查看输出的键。假设所有映射器的输出如下：

A,N,C,K,Z,S,U

现在分区程序检查您的分区文件并将记录发送到相应的减速器。让我们假设您已经定义了 4 个 reducer，因此每个 reducer 将处理一个范围：

Reducer 1 handles A,C
Reducer 2 handles N,K,S
Reducer 3 handles U
Reducer 4 handles Z

这表明，与您使用的 reducer 数量相比，您的分区文件必须至少包含 n-1 个元素。 docs 的另一个重要说明：

If the keytype is BinaryComparable and total.order.partitioner.natural.order is not false, a trie of the first total.order.partitioner.max.trie.depth(2) + 1 bytes will be built. Otherwise, keys will be located using a binary search of the partition keyset using the RawComparator defined for this job. The input file must be sorted with the same comparator and contain JobContextImpl.getNumReduceTasks() - 1 keys.

TotalOrderPartitioner 和分区文件

TotalOrderPartitioner and Partition file

java

hadoop

mapreduce

hadoop-partitioning