Map -> Reduce -> Reduce(依次调用两个reducer)——如何配置驱动程序
Map -> Reduce -> Reduce (two reducers to be called sequentially) - how to configure driver program
我需要编写一个连续调用两个reducer 的Map reduce 程序。即,第一个减速器的输出将是第二个减速器的输入。我该如何实现?
到目前为止我所发现的表明我需要在我的驱动程序代码(下面的代码)中配置两个 map reduce 作业。
这看起来很浪费,有两个原因 -
- 我真的不需要第二份工作的映射器
- 有两份工作看起来有点矫枉过正。
有没有更好的方法来实现这个?
此外,关于以下方法的问题:Job1 的输出将是 OUTPUT_PATH 目录中的多个文件。这个目录作为Job2的输入传入,这样可以吗?不是必须是文件吗? Job2 会处理给定目录下的所有文件吗?
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
Job job = new Job(conf, "Job1");
job.setJarByClass(ChainJobs.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReducer1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/
/*
* Job 2
*/
Configuration conf2 = getConf();
Job job2 = new Job(conf2, "Job 2");
job2.setJarByClass(ChainJobs.class);
job2.setMapperClass(MyMapper2.class);
job2.setReducerClass(MyReducer2.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
TextOutputFormat.setOutputPath(job2, new Path(args[1]));
return job2.waitForCompletion(true) ? 0 : 1;
dont really need a maper in second job
框架确实如此,但
having two jobs looks like an overkill... Is there a better way to achieve this?
那就不要使用 MapReduce...例如 Spark 可能会更快并且代码更少
Will Job2 process all files under the given directory?
是
我需要编写一个连续调用两个reducer 的Map reduce 程序。即,第一个减速器的输出将是第二个减速器的输入。我该如何实现?
到目前为止我所发现的表明我需要在我的驱动程序代码(下面的代码)中配置两个 map reduce 作业。
这看起来很浪费,有两个原因 -
- 我真的不需要第二份工作的映射器
- 有两份工作看起来有点矫枉过正。
有没有更好的方法来实现这个?
此外,关于以下方法的问题:Job1 的输出将是 OUTPUT_PATH 目录中的多个文件。这个目录作为Job2的输入传入,这样可以吗?不是必须是文件吗? Job2 会处理给定目录下的所有文件吗?
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
Job job = new Job(conf, "Job1");
job.setJarByClass(ChainJobs.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReducer1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/
/*
* Job 2
*/
Configuration conf2 = getConf();
Job job2 = new Job(conf2, "Job 2");
job2.setJarByClass(ChainJobs.class);
job2.setMapperClass(MyMapper2.class);
job2.setReducerClass(MyReducer2.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
TextOutputFormat.setOutputPath(job2, new Path(args[1]));
return job2.waitForCompletion(true) ? 0 : 1;
dont really need a maper in second job
框架确实如此,但
having two jobs looks like an overkill... Is there a better way to achieve this?
那就不要使用 MapReduce...例如 Spark 可能会更快并且代码更少
Will Job2 process all files under the given directory?
是