Map-reduce 通过 Oozie

Map-reduce via Oozie

如果我使用 Oozie 来 运行 MapReduce 作业,是否有关于将启动多少映射器的具体数字? 是:

  1. 一个用于 Oozie,一个用于 map-reduce 作业或
  2. Oozie 一个,每 64MB 块一个映射器(默认块大小)

简短回答: Oozie 通过向集群提交 一个 maponly 作业来启动 mapreduce 作业称为 Oozie 启动器。 同意@Dennis Jaheruddin 的观点。

我研究后的详细答案:Oozie 的执行模型

Oozie’s execution model is different from the default approach users take to run Hadoop jobs. When a user invokes the Hadoop, Hive, or Pig CLI tool from a Hadoop edge node, the corresponding client executable runs on that node which is configured to contact and submit jobs to the Hadoop cluster. When the same jobs are defined and submitted via an Oozie workflow action, things work differently.

Let’s say you are submitting a workflow job using the Oozie CLI on the edge node. The Oozie client actually submits the workflow to the Oozie server, which typically runs on a different node. Regardless of where it runs, it’s the Oozie server’s responsibility to submit and run the underlying MapReduce jobs on the Hadoop cluster. Oozie doesn’t do so by using the standard client tools installed locally on the Oozie server node. Instead, it first submits a MapReduce job called the “launcher job,” which in turn runs the Hadoop, Hive, or Pig job using the appropriate client APIs.

Imp Note : The Oozie launcher is basically a map-only job running a single mapper on the Hadoop cluster. This map job knows what to do for the specific action it’s supposed to run and does the appropriate thing by using the libraries for Hadoop, Pig, etc. This will result in other MapReduce jobs being spun up as required. These Oozie jobs are called “asynchronous actions” in Oozie parlance. Oozie doesn’t run these actions in its own server, but kicks them off on the Hadoop cluster using a launcher job. The reason Oozie server “outsources” the launcher to the Hadoop cluster is to protect itself from unexpected workloads and also to isolate user code from its own services. After all, Oozie has access to an awesome distributed system in the form of a Hadoop cluster.

对于 Mapreduce 操作,您可以设置 maptasks 的数量,但不能保证,这取决于如下所述。

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

Number of Maps

地图的数量通常由输入文件中的 DFS 块数量决定。虽然这会导致人们调整他们的 DFS 块大小来调整地图的数量。地图的正确并行度似乎在 10-100 maps/node 左右,尽管对于非常 cpu-light 地图任务,我们已经将其提高到 300 左右。任务设置需要一段时间,因此最好地图至少需要一分钟来执行

映射器的数量取决于逻辑输入拆分的数量,它不取决于块的数量。您可以通过程序控制输入拆分的数量。

有关输入拆分如何影响映射器数量以及如何创建输入拆分的更多信息,请参阅此 https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/

以上回答主要关注一个mapreduce作业需要多少个map和reduce。但是,当您特别询问 oozie 时,我将通过 Oozie 分享我在 mapreduce(在猪中)方面的经验。

说明

启动 oozie 工作流时,需要 1 个 yarn 应用程序。我不确定逻辑是什么,但看起来这些应用程序通常需要 1 张地图,偶尔需要 2 张。

除此之外,您还需要与不使用 oozie 时相同数量的映射器和缩减器来完成实际工作。 (如果您看到与预期不同的数字,这可能是因为您在调用脚本时在 map 或 reduce 属性上传递了特定参数)。

警告

上面的意思是,如果您有 100 个可用容器,并启动 100 个工作流(例如,开始一个开始日期为过去 100 天的日常工作),工作流很可能会占用所有可用容器,实际工作无限期暂停。