使用 Boto 创建 EMR 失败
Creating EMR using Boto fails
我正在尝试使用 boto 库从 python 创建 emr 集群,
我尝试了一些东西,但最终结果是 "Shut down as step failed"
我尝试了 运行 亚马逊提供的关于 wordcount 的示例代码,但仍然失败。
当我检查日志时,我发现 emr 找不到映射器所在的位置。
s3n://elasticmapreduce/samples/wordcount/wordSplitter.py": error=2, No such file or directory
这让我想到了我在某个网站上发现的来自亚马逊的回复:
Hello,
Beginning with AMI 3.x with Hadoop 2 and moving forward EMR Hadoop streaming will support the standard Hadoop style reference for streaming jobs.
This means that s3 referenced mappers and reducers will need to be placed into > the "-files" argument.
For example,
elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-> uri s3n://mybucket/logs --stream --mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input
s3://mybucket/input/alice.txt --output s3://mybucket/output --reducer aggregate
becomes:
elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-> uri s3n://mybucket/logs --stream --arg "-files" --arg
"s3://elasticmapreduce/samples/wordcount/wordSplitter.py" --mapper
wordSplitter.py --input s3://mybucket/input/alice.txt --output
s3://mybucket/output --reducer aggregate
现在我想看看这是否适合我,但我无法理解如何设置 --files 标志和他提到的参数
这是我当前的代码:
self._steps.append(StreamingStep(
name=step_description,
mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
reducer='aggregate',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test'))
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
-------------编辑-------------
看起来这就是答案,我将 ami_version 降低到 2.4.11,这是 Hadoop 2 的最后一个版本,现在可以使用相同的代码。
我真的不知道我是否真的需要最新的 Hadoop 版本,可能不需要,但我没有使用 Amazon 提供的最新版本这让我很烦恼。
----------------edit2----------------
找到解决方案,
//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')
// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
name=step_description,
mapper='<map_script_name>',
reducer='<reduce_script_name>',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test',
step_args=step_args)
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
希望对某人有所帮助...
找到解决方案,
//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')
// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
name=step_description,
mapper='<map_script_name>',
reducer='<reduce_script_name>',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test',
step_args=step_args)
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
希望对大家有所帮助...
我正在尝试使用 boto 库从 python 创建 emr 集群, 我尝试了一些东西,但最终结果是 "Shut down as step failed" 我尝试了 运行 亚马逊提供的关于 wordcount 的示例代码,但仍然失败。
当我检查日志时,我发现 emr 找不到映射器所在的位置。
s3n://elasticmapreduce/samples/wordcount/wordSplitter.py": error=2, No such file or directory
这让我想到了我在某个网站上发现的来自亚马逊的回复:
Hello,
Beginning with AMI 3.x with Hadoop 2 and moving forward EMR Hadoop streaming will support the standard Hadoop style reference for streaming jobs.
This means that s3 referenced mappers and reducers will need to be placed into > the "-files" argument.
For example,
elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-> uri s3n://mybucket/logs --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://mybucket/input/alice.txt --output s3://mybucket/output --reducer aggregate
becomes:
elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-> uri s3n://mybucket/logs --stream --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" --mapper wordSplitter.py --input s3://mybucket/input/alice.txt --output s3://mybucket/output --reducer aggregate
现在我想看看这是否适合我,但我无法理解如何设置 --files 标志和他提到的参数
这是我当前的代码:
self._steps.append(StreamingStep(
name=step_description,
mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
reducer='aggregate',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test'))
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
-------------编辑-------------
看起来这就是答案,我将 ami_version 降低到 2.4.11,这是 Hadoop 2 的最后一个版本,现在可以使用相同的代码。
我真的不知道我是否真的需要最新的 Hadoop 版本,可能不需要,但我没有使用 Amazon 提供的最新版本这让我很烦恼。
----------------edit2----------------
找到解决方案,
//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')
// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
name=step_description,
mapper='<map_script_name>',
reducer='<reduce_script_name>',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test',
step_args=step_args)
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
希望对某人有所帮助...
找到解决方案,
//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')
// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
name=step_description,
mapper='<map_script_name>',
reducer='<reduce_script_name>',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test',
step_args=step_args)
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
希望对大家有所帮助...