在 mapreduce 中，"shuffle step" 如何决定每个键应该去哪里？

Question

让我们考虑 map reduce 作业的基本字数统计示例和以下输入：

word1
word2
word1
word2
word3

对于我们的处理，我们认为我们有三个映射器和三个缩减器。

对于映射，数据处理如下：

MAP1: (word1,1), (word2,1)
MAP2: (word1,1), (word2,1)
MAP3: (word3,1)

现在，洗牌阶段开始。 word1 键需要在一起，以及 word2 和 word3 键。

洗牌阶段可以决定将 word1 发送到 reducer1，将 word2 发送到 reducer2，将 word3 发送到 reducer3，或者 word1到reducer2等

如何决定将每个键洗牌到哪个reducer？

Answer 1

在减少步骤之前，hadoop 使用 Partitioner 的实现来确定应该将密钥发送到哪里。默认情况下它是 HashPartitioner 方法：

public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

如果您的工作需要一些额外的逻辑，您可以使用自定义实现：

job.setPartitionerClass(YourPartitioner.class)

In mapreduce, how does the "shuffle step" decides where should go each key?