reducer 完成后调用 mapper

Question

我正在执行作业：

hadoop/bin/./hadoop jar /home/hadoopuser/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar  -D mapred.reduce.tasks=2 -file kmeans_mapper.py    -mapper kmeans_mapper.py -file kmeans_reducer.py \
-reducer kmeans_reducer.py -input gutenberg/small_train.csv -output gutenberg/out

当两个 reducer 完成后，我想对结果做一些事情，所以 理想情况下 我想调用另一个文件（另一个映射器？）来接收减速器的输出作为其输入。如何轻松做到这一点？

我检查了这个 blog，它有一个 Mrjob 示例，没有解释，我不知道如何做我的。

MapReduce tutorial 状态：

Users may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job.

但它没有给出任何例子...

这里是some code in Java我能看懂，但是我写的是Python！ :/

这个问题揭示了一些亮点：Chaining multiple mapreduce tasks in Hadoop streaming

Answer 1

可以使用 Java API 来完成您的要求，因为您已经找到了一个示例。

但是，您使用的是流式传输 API，它只是读取标准输入并写入标准输出。除了完成 hadoop jar 命令之外，没有回调说明 mapreduce 作业何时完成。但是，因为它完成了，并不真正表示 "success"。话虽这么说，但如果没有更多关于流媒体的工具，这真的是不可能的 API。

如果输出被写入本地终端而不是 HDFS，则可以将该输出通过管道传输到另一个流作业的输入，但不幸的是，到 steaming jar 的输入和输出需要 HDFS 上的路径.

reducer 完成后调用 mapper

Call mapper when reducer is done

python

hadoop

mapreduce

distributed-computing

cluster-computing