什么时候应该在 Hadoop 中使用 OutputCollector 和 Context？

Question

在 this 文章中，我找到了这个用于字数统计的映射器代码：

  public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, 
                    OutputCollector<Text, IntWritable> output, 
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        output.collect(word, one);
      }
    }
  }

相反，在 official tutorial 中，这是提供的映射器：

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

到现在为止，我只见过Context从mapper写东西到reducer，没见过（或用过）OutputCollector。我读过documentation，但我不明白它的使用要领或为什么要使用它。

Answer 1

这是一个很好的解决方案，但是，我只使用了 1 行解决方案，即： int wordcount = string.split(" ").length - 1;

Answer 2

两个代码包含不同的 API Map Reduce。OutputCollector 在 MRV1 中，Context 在 MRV2

中

The Java Map Reduce API 1 also known as MRV1 was released with initial hadoop versions and the flaw associated with these initial versions was map reduce framework performing both the task of processing and resource management.

Map Reduce 2 or the Next Generation Map Reduce, was a long-awaited and much-needed upgrade to the techniques concerned with scheduling, resource management, and the execution occurring in Hadoop. Fundamentally, the improvements separate cluster resource management capabilities from Map Reduce-specific logic and this separation of processing and resource management were achieved via inception of YARN in later versions of HADOOP.

MRV1 使用 OutputCollecter 和 Reporter 与 MapReduce 系统通信。

MRV2 使用 API 广泛使用 context 允许用户代码与 MapReduce 系统通信的对象。（旧 API 中的 JobConf、OutputCollector 和 Reporter 的角色由 MRV2 中的 Contexts 对象统一）。

使用时应使用 mapreduce 2 (MRV2)。我强调了 hadoop 2 相对于 hadoop 的最大优势：

一个主要的优势是，没有jobtrackers和tasktrackers hadoop2架构。我们有 YARN 资源管理器和节点经理代替。这有助于 hadoop2 支持除 mapreduce 框架来执行代码并克服高延迟与 mapreduce 相关的问题。
Hadoop2 支持非批处理和传统批处理操作。
hadoop2引入了hdfs联合。这使多个名称节点来控制尝试处理单个的 hadoop 集群 hadoop的点故障问题。

MRV2还有很多优点。 https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/

什么时候应该在 Hadoop 中使用 OutputCollector 和 Context？

When should I use OutputCollector and Context in Hadoop?

java

hadoop

mapreduce