使用 Boto 创建 EMR 失败

Question

我正在尝试使用 boto 库从 python 创建 emr 集群，我尝试了一些东西，但最终结果是 "Shut down as step failed" 我尝试了运行亚马逊提供的关于 wordcount 的示例代码，但仍然失败。

当我检查日志时，我发现 emr 找不到映射器所在的位置。

s3n://elasticmapreduce/samples/wordcount/wordSplitter.py": error=2, No such file or directory

这让我想到了我在某个网站上发现的来自亚马逊的回复：

Hello,

Beginning with AMI 3.x with Hadoop 2 and moving forward EMR Hadoop streaming will support the standard Hadoop style reference for streaming jobs.

This means that s3 referenced mappers and reducers will need to be placed into > the "-files" argument.

For example,

elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-> uri s3n://mybucket/logs --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://mybucket/input/alice.txt --output s3://mybucket/output --reducer aggregate

becomes:

elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-> uri s3n://mybucket/logs --stream --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" --mapper wordSplitter.py --input s3://mybucket/input/alice.txt --output s3://mybucket/output --reducer aggregate

现在我想看看这是否适合我，但我无法理解如何设置 --files 标志和他提到的参数

这是我当前的代码：

self._steps.append(StreamingStep(
    name=step_description,
    mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
    reducer='aggregate',
    input='s3n://elasticmapreduce/samples/wordcount/input',
    output='s3n://'+'test'))

conn.run_jobflow(
    availability_zone='us-east-1b',
    name=job_description,
    master_instance_type='m3.xlarge',
    slave_instance_type='m3.xlarge',
    num_instances=3,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    log_uri='s3://'+"logs",
    ami_version="3.6.0",
    steps=self._steps,
    bootstrap_actions=self._actions,
    visible_to_all_users=True
)

-------------编辑-------------
看起来这就是答案，我将 ami_version 降低到 2.4.11，这是 Hadoop 2 的最后一个版本，现在可以使用相同的代码。我真的不知道我是否真的需要最新的 Hadoop 版本，可能不需要，但我没有使用 Amazon 提供的最新版本这让我很烦恼。

----------------edit2----------------
找到解决方案，

//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')

// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
    name=step_description,
    mapper='<map_script_name>',
    reducer='<reduce_script_name>',
    input='s3n://elasticmapreduce/samples/wordcount/input',
    output='s3n://'+'test',
    step_args=step_args)

conn.run_jobflow(
    availability_zone='us-east-1b',
    name=job_description,
    master_instance_type='m3.xlarge',
    slave_instance_type='m3.xlarge',
    num_instances=3,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    log_uri='s3://'+"logs",
    ami_version="3.6.0",
    steps=self._steps,
    bootstrap_actions=self._actions,
    visible_to_all_users=True
)

希望对某人有所帮助...

Answer 1

找到解决方案，

//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')

// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
    name=step_description,
    mapper='<map_script_name>',
    reducer='<reduce_script_name>',
    input='s3n://elasticmapreduce/samples/wordcount/input',
    output='s3n://'+'test',
    step_args=step_args)

conn.run_jobflow(
    availability_zone='us-east-1b',
    name=job_description,
    master_instance_type='m3.xlarge',
    slave_instance_type='m3.xlarge',
    num_instances=3,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    log_uri='s3://'+"logs",
    ami_version="3.6.0",
    steps=self._steps,
    bootstrap_actions=self._actions,
    visible_to_all_users=True
)

希望对大家有所帮助...

使用 Boto 创建 EMR 失败

Creating EMR using Boto fails

boto

emr