MapReduce 输出键升序排列

Question

我编写了一个 MapReduce 代码，其中的键和值都是整数。我正在使用单个 Reducer。输出是这样的：

Key    Value
1      78
128    12
174    26
2      44
2957   123
975    91

有没有办法让输出按升序按键排序？这样输出看起来像这样：

我需要使用 conf.setComparator 吗？如果是，我该怎么做？

Answer 1

使用 TreeMap。它是为此创建的：

A Red-Black tree based NavigableMap implementation. The map is sorted according to the natural ordering of its keys, or by a Comparator provided at map creation time, depending on which constructor is used.

Answer 2

这需要

TotalOrderPartitioner

https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/partition/TotalOrderPartitioner.html

它在 M/R 管道中强制执行一个附加阶段，以将元素划分到排序的桶中。

TreeMap 解决方案不会在全局范围内工作，而只能在每个 Reducer 中工作。

这是展示如何使用 TotalOrderPartioner 的要点（不是我的）：https://gist.github.com/asimjalis/e5627dc2ff2b23dac70b

要点的主要内容是：

a) 您需要调用 reducer.setPartitionerClass TotalOrderPartitioner:

  // Use Total Order Partitioner.
  reduceJob.setPartitionerClass(TotalOrderPartitioner.class);

b) 您需要生成一组拆分用作 TOP

的 "buckets"

  // Generate partition file from map-only job's output.
  TotalOrderPartitioner.setPartitionFile(
      reduceJob.getConfiguration(), partitionPath);
  InputSampler.writePartitionFile(reduceJob, new InputSampler.RandomSampler(
      1, 10000));

Answer 3

我在这里看到三个选项：

（首选）使用（我的 +1）。这更通用，但需要更多努力。
可以的话就用单减速机吧。这就要求所有的数据都能装在一台机器的内存中。然后，单个reducer的输入将按key（你想要的）的升序排序。
作业完成后，可以使用hdfs的getmerge命令，然后手动对合并后的文件进行排序，例如使用the sort command of Linux（甚至merge-sort多个文件，没有 getmerge 命令）。毕竟，您不必对所有事情都使用 MapReduce！注意只根据键排序！例如，您可以运行:
```
sort -n -k1,1 filename
```
但还有更多排序选项...

作为最后的说明（完成），以上所有假设您不使用 Map-only 作业，其中输出未排序。如果是这样的话，我只能看到选项 3 起作用。

UPDATE：供将来参考，根据评论，输出键似乎不是 IntWritable 类型，因此未按整数排序。

MapReduce 输出键升序排列

MapReduce output key in ascending order

java

sorting

hadoop

mapreduce

TotalOrderPartitioner