Map -> Reduce -> Reduce（依次调用两个reducer）——如何配置驱动程序

Question

我需要编写一个连续调用两个reducer 的Map reduce 程序。即，第一个减速器的输出将是第二个减速器的输入。我该如何实现？

到目前为止我所发现的表明我需要在我的驱动程序代码（下面的代码）中配置两个 map reduce 作业。
这看起来很浪费，有两个原因 -

我真的不需要第二份工作的映射器
有两份工作看起来有点矫枉过正。

有没有更好的方法来实现这个？

此外，关于以下方法的问题：Job1 的输出将是 OUTPUT_PATH 目录中的多个文件。这个目录作为Job2的输入传入，这样可以吗？不是必须是文件吗？ Job2 会处理给定目录下的所有文件吗？

Configuration conf = getConf();
  FileSystem fs = FileSystem.get(conf);
  Job job = new Job(conf, "Job1");
  job.setJarByClass(ChainJobs.class);

  job.setMapperClass(MyMapper1.class);
  job.setReducerClass(MyReducer1.class);

  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);

  job.setInputFormatClass(TextInputFormat.class);
  job.setOutputFormatClass(TextOutputFormat.class);

  TextInputFormat.addInputPath(job, new Path(args[0]));
  TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

  job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/


  /*
   * Job 2
   */
  Configuration conf2 = getConf();
  Job job2 = new Job(conf2, "Job 2");
  job2.setJarByClass(ChainJobs.class);

  job2.setMapperClass(MyMapper2.class);
  job2.setReducerClass(MyReducer2.class);

  job2.setOutputKeyClass(Text.class);
  job2.setOutputValueClass(Text.class);

  job2.setInputFormatClass(TextInputFormat.class);
  job2.setOutputFormatClass(TextOutputFormat.class);

  TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
  TextOutputFormat.setOutputPath(job2, new Path(args[1]));

  return job2.waitForCompletion(true) ? 0 : 1;

Answer 1

dont really need a maper in second job

框架确实如此，但

having two jobs looks like an overkill... Is there a better way to achieve this?

那就不要使用 MapReduce...例如 Spark 可能会更快并且代码更少

Will Job2 process all files under the given directory?

是

Map -> Reduce -> Reduce（依次调用两个reducer）——如何配置驱动程序

Map -> Reduce -> Reduce (two reducers to be called sequentially) - how to configure driver program

java

hadoop

mapreduce