如何在 Amazon EMR 上制作 Pyspark 脚本 运行 来识别 boto3 模块?它说找不到模块
How to make Pyspark script running on Amazon EMR to recognize boto3 module? It says module not found
Spark version 2.4.5
我有需要在 S3 存储桶中处理的文件。 (s3a://tobeprocessed
)
我有一个 pyspark 应用程序,它从 S3 存储桶读取文件并将输出写入另一个 S3 存储桶 (s3://processed
)。
我打算运行将其作为我的 emr 集群中的一个步骤函数。
我过去常常按照终端的命令向我的集群添加一个步骤。
aws emr add-steps --cluster-id j-xxxxxx --steps Name=etlapp,Jar=command-runner.jar,Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,s3://bucketname/spark_app.py,s3://bucketname/configuration_file.cfg],ActionOnFailure=CONTINUE
我收到这样的错误消息
STDERR
20/03/10 19:50:46 INFO RMProxy: Connecting to ResourceManager at ip-172-31-27-34.ec2.internal/172.31.27.34:8032
20/03/10 19:50:47 INFO Client: Requesting a new application from cluster with 2 NodeManagers
20/03/10 19:50:47 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
20/03/10 19:50:47 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/03/10 19:50:47 INFO Client: Setting up container launch context for our AM
20/03/10 19:50:47 INFO Client: Setting up the launch environment for our AM container
20/03/10 19:50:47 INFO Client: Preparing resources for our AM container
20/03/10 19:50:47 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/03/10 19:50:49 INFO Client: Uploading resource file:/mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff/__spark_libs__2224504543987850085.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/__spark_libs__2224504543987850085.zip
20/03/10 19:50:50 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
20/03/10 19:50:50 INFO Client: Uploading resource s3://imdbetlapp/complete_etl.py -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/complete_etl.py
20/03/10 19:50:51 INFO S3NativeFileSystem: Opening 's3://imdbetlapp/complete_etl.py' for reading
20/03/10 19:50:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/pyspark.zip
20/03/10 19:50:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/py4j-0.10.7-src.zip
20/03/10 19:50:52 INFO Client: Uploading resource file:/mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff/__spark_conf__476112427502500805.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/__spark_conf__.zip
20/03/10 19:50:52 INFO SecurityManager: Changing view acls to: hadoop
20/03/10 19:50:52 INFO SecurityManager: Changing modify acls to: hadoop
20/03/10 19:50:52 INFO SecurityManager: Changing view acls groups to:
20/03/10 19:50:52 INFO SecurityManager: Changing modify acls groups to:
20/03/10 19:50:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/03/10 19:50:53 INFO Client: Submitting application application_1583867709817_0003 to ResourceManager
20/03/10 19:50:53 INFO YarnClientImpl: Submitted application application_1583867709817_0003
20/03/10 19:50:54 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:54 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1583869853550
final status: UNDEFINED
tracking URL: http://ip-172-31-27-34.ec2.internal:20888/proxy/application_1583867709817_0003/
user: hadoop
20/03/10 19:50:55 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:56 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:57 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:58 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:59 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:00 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:01 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:02 INFO Client: Application report for application_1583867709817_0003 (state: FAILED)
20/03/10 19:51:02 INFO Client:
client token: N/A
diagnostics: Application application_1583867709817_0003 failed 2 times due to AM Container for appattempt_1583867709817_0003_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1583867709817_0003_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003 Then click on links to logs of each attempt.
. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1583869853550
final status: FAILED
tracking URL: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003
user: hadoop
20/03/10 19:51:02 ERROR Client: Application diagnostics message: Application application_1583867709817_0003 failed 2 times due to AM Container for appattempt_1583867709817_0003_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1583867709817_0003_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1583867709817_0003 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/03/10 19:51:02 INFO ShutdownHookManager: Shutdown hook called
20/03/10 19:51:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff
20/03/10 19:51:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-a72b6dba-91bb-46b0-b2c3-893ac3b8581f
Command exiting with ret '1'
当我挖掘容器日志时,我发现脚本吐出如下错误
boto3 module not found
我运行一个使用pip安装boto3的bootstrap脚本。
我什至登录了主节点,发现boto3
是使用命令pip list
安装的
在控制台主页上,单击“创建集群”,会出现一个页面。在顶部,有一个 "go to advanced options" 的选项。在那里你会找到 "Auto terminate" on "After last step completes" 的选项
您必须使用安装 boto3
的 bootstrap 脚本,但您必须非常具体地说明所使用的 python 版本。
sudo pip-3.6 install boto3
Spark version 2.4.5
我有需要在 S3 存储桶中处理的文件。 (s3a://tobeprocessed
)
我有一个 pyspark 应用程序,它从 S3 存储桶读取文件并将输出写入另一个 S3 存储桶 (s3://processed
)。
我打算运行将其作为我的 emr 集群中的一个步骤函数。
我过去常常按照终端的命令向我的集群添加一个步骤。
aws emr add-steps --cluster-id j-xxxxxx --steps Name=etlapp,Jar=command-runner.jar,Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,s3://bucketname/spark_app.py,s3://bucketname/configuration_file.cfg],ActionOnFailure=CONTINUE
我收到这样的错误消息
STDERR
20/03/10 19:50:46 INFO RMProxy: Connecting to ResourceManager at ip-172-31-27-34.ec2.internal/172.31.27.34:8032
20/03/10 19:50:47 INFO Client: Requesting a new application from cluster with 2 NodeManagers
20/03/10 19:50:47 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
20/03/10 19:50:47 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/03/10 19:50:47 INFO Client: Setting up container launch context for our AM
20/03/10 19:50:47 INFO Client: Setting up the launch environment for our AM container
20/03/10 19:50:47 INFO Client: Preparing resources for our AM container
20/03/10 19:50:47 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/03/10 19:50:49 INFO Client: Uploading resource file:/mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff/__spark_libs__2224504543987850085.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/__spark_libs__2224504543987850085.zip
20/03/10 19:50:50 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
20/03/10 19:50:50 INFO Client: Uploading resource s3://imdbetlapp/complete_etl.py -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/complete_etl.py
20/03/10 19:50:51 INFO S3NativeFileSystem: Opening 's3://imdbetlapp/complete_etl.py' for reading
20/03/10 19:50:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/pyspark.zip
20/03/10 19:50:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/py4j-0.10.7-src.zip
20/03/10 19:50:52 INFO Client: Uploading resource file:/mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff/__spark_conf__476112427502500805.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/__spark_conf__.zip
20/03/10 19:50:52 INFO SecurityManager: Changing view acls to: hadoop
20/03/10 19:50:52 INFO SecurityManager: Changing modify acls to: hadoop
20/03/10 19:50:52 INFO SecurityManager: Changing view acls groups to:
20/03/10 19:50:52 INFO SecurityManager: Changing modify acls groups to:
20/03/10 19:50:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/03/10 19:50:53 INFO Client: Submitting application application_1583867709817_0003 to ResourceManager
20/03/10 19:50:53 INFO YarnClientImpl: Submitted application application_1583867709817_0003
20/03/10 19:50:54 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:54 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1583869853550
final status: UNDEFINED
tracking URL: http://ip-172-31-27-34.ec2.internal:20888/proxy/application_1583867709817_0003/
user: hadoop
20/03/10 19:50:55 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:56 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:57 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:58 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:59 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:00 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:01 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:02 INFO Client: Application report for application_1583867709817_0003 (state: FAILED)
20/03/10 19:51:02 INFO Client:
client token: N/A
diagnostics: Application application_1583867709817_0003 failed 2 times due to AM Container for appattempt_1583867709817_0003_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1583867709817_0003_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003 Then click on links to logs of each attempt.
. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1583869853550
final status: FAILED
tracking URL: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003
user: hadoop
20/03/10 19:51:02 ERROR Client: Application diagnostics message: Application application_1583867709817_0003 failed 2 times due to AM Container for appattempt_1583867709817_0003_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1583867709817_0003_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1583867709817_0003 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/03/10 19:51:02 INFO ShutdownHookManager: Shutdown hook called
20/03/10 19:51:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff
20/03/10 19:51:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-a72b6dba-91bb-46b0-b2c3-893ac3b8581f
Command exiting with ret '1'
当我挖掘容器日志时,我发现脚本吐出如下错误
boto3 module not found
我运行一个使用pip安装boto3的bootstrap脚本。
我什至登录了主节点,发现boto3
是使用命令pip list
在控制台主页上,单击“创建集群”,会出现一个页面。在顶部,有一个 "go to advanced options" 的选项。在那里你会找到 "Auto terminate" on "After last step completes" 的选项
您必须使用安装 boto3
的 bootstrap 脚本,但您必须非常具体地说明所使用的 python 版本。
sudo pip-3.6 install boto3