使用 Mxnet 的 Hadoop 流作业在 AWS Emr 中失败

Question

我在 AWS 数据管道中设置了一个 emr 步骤。步骤命令如下所示：

/usr/lib/hadoop-mapreduce/hadoop-streaming.jar,-input,s3n://input-bucket/input-file,-output,s3://output/output-dir,-mapper,/bin/cat,-reducer,reducer.py,-file,/scripts/reducer.py,-file,/params/parameters.bin

我收到以下错误

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
    at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
    at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

我已经在我的桌面上单独尝试了运行 reducer 步骤（在单节点 hadoop 设置上）及其工作。我已经在 reducer 脚本中包含了 #!/usr/bin/env python。 ~~我怀疑我没有正确编写 EMR 步骤。~~

EMR version: 5.5.0

编辑： 经过进一步调查，我找到了 emr 中 reducer 代码失败的确切代码行。我正在使用 mxnet library in the reducer. When I load the model parameters, the reducer fails. Reference to API doc is here

进行机器学习预测

module.load_params('parameters.bin')

我已经检查了 EMR 节点的当前工作目录 [使用 os.listdir(os.getcwd())]，它包含 parameters.bin 文件（我什至成功打印了文件内容）。我想再次指出，流作业在我的单节点本地设置上运行良好。

EDIT2: 我将 reducer 任务的数量设置为 2。我将我的 reducer 代码包含在一个 try-except 块中，我在其中一个任务中看到以下错误 (另一个运行良好）

[10:27:25] src/ndarray/ndarray.cc:299: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (119,) to.shape=(111,) 

Stack trace returned 10 entries:    
[bt] (0) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc72fc) [0x7f81443842fc]    
[bt] (1) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc166f4) [0x7f8144ed36f4]   
[bt] (2) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc74c24) [0x7f8144f31c24]   
[bt] (3) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(MXImperativeInvoke+0x2cd) [0x7f8144db935d]    
[bt] (4) /usr/lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7f8150b8acec]  
[bt] (5) /usr/lib64/libffi.so.6(ffi_call+0x1f5) [0x7f8150b8a615]    
[bt] (6) /usr/lib64/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x30b) [0x7f8150d9d97b]   
[bt] (7) /usr/lib64/python2.7/lib-dynload/_ctypes.so(+0xa915) [0x7f8150d97915]  
[bt] (8) /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f815a69e183]    
[bt] (9) /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x337d) [0x7f815a73107d]

Answer 1

我想通了。实际上，mxnet 期望的形状取决于数据集（它实际上取决于数据集中的最大值）。训练发生在单个 gpu 盒子上并具有整个数据集。然而，该预测适用于单节点设置，因为它具有训练中使用的所有数据。但是当使用多节点集群时，数据集会被拆分，这使得每个节点的最大值都不同。这是导致错误的原因。

我现在已经使预期的形状独立于数据集并且不再发生此错误。我希望这能澄清事情。

使用 Mxnet 的 Hadoop 流作业在 AWS Emr 中失败

Hadoop streaming job using Mxnet failing in AWS Emr

hadoop

emr

hadoop-streaming

amazon-data-pipeline

mxnet