当 运行 PySpark 通过 EMR 控制台时,ExitCodeException exitCode=13

ExitCodeException exitCode=13 when running PySpark via EMR console

我正在尝试通过控制台在 EMR 上 运行 pyspark 脚本。为此,我首先在本地测试脚本,从 s3 下载一个小样本 csv 到我的计算机,并使用 spark-submit 将聚合结果写回本地文件夹。现在,我必须 运行 在 EMR 上使用相同的脚本,使用集群,因为我必须在更大的范围内进行。

到目前为止,我已经尝试了在 Stack Overflow 和其他论坛中可以找到的所有内容,但无法摆脱以下错误:

19/11/18 18:40:07 INFO RMProxy: Connecting to ResourceManager at ip-10-101-30-101.ec2.internal/10.101.30.101:8032
19/11/18 18:40:07 INFO Client: Requesting a new application from cluster with 3 NodeManagers
19/11/18 18:40:07 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
19/11/18 18:40:07 INFO Client: Will allocate AM container, with 12288 MB memory including 1117 MB overhead
19/11/18 18:40:07 INFO Client: Setting up container launch context for our AM
19/11/18 18:40:07 INFO Client: Setting up the launch environment for our AM container
19/11/18 18:40:07 INFO Client: Preparing resources for our AM container
19/11/18 18:40:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/11/18 18:40:09 INFO Client: Uploading resource file:/mnt/tmp/spark-c251bf55-4c00-485a-8947-617394cc3bb4/__spark_libs__4633570638919089381.zip -> hdfs://ip-10-101-30-101.ec2.internal:8020/user/hadoop/.sparkStaging/application_1574102290151_0001/__spark_libs__4633570638919089381.zip
19/11/18 18:40:10 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-101-30-101.ec2.internal:8020/user/hadoop/.sparkStaging/application_1574102290151_0001/hive-site.xml
19/11/18 18:40:11 INFO Client: Uploading resource s3a://cody-dev-bi-s3/temp/pyspark_job.py -> hdfs://ip-10-101-30-101.ec2.internal:8020/user/hadoop/.sparkStaging/application_1574102290151_0001/pyspark_job.py
19/11/18 18:40:12 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-101-30-101.ec2.internal:8020/user/hadoop/.sparkStaging/application_1574102290151_0001/pyspark.zip
19/11/18 18:40:12 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://ip-10-101-30-101.ec2.internal:8020/user/hadoop/.sparkStaging/application_1574102290151_0001/py4j-0.10.7-src.zip
19/11/18 18:40:12 INFO Client: Uploading resource file:/mnt/tmp/spark-c251bf55-4c00-485a-8947-617394cc3bb4/__spark_conf__2275605486560105863.zip -> hdfs://ip-10-101-30-101.ec2.internal:8020/user/hadoop/.sparkStaging/application_1574102290151_0001/__spark_conf__.zip
19/11/18 18:40:13 INFO SecurityManager: Changing view acls to: hadoop
19/11/18 18:40:13 INFO SecurityManager: Changing modify acls to: hadoop
19/11/18 18:40:13 INFO SecurityManager: Changing view acls groups to: 
19/11/18 18:40:13 INFO SecurityManager: Changing modify acls groups to: 
19/11/18 18:40:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
19/11/18 18:40:15 INFO Client: Submitting application application_1574102290151_0001 to ResourceManager
19/11/18 18:40:15 INFO YarnClientImpl: Submitted application application_1574102290151_0001
19/11/18 18:40:16 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:16 INFO Client: 
     client token: N/A
     diagnostics: AM container is launched, waiting for AM container to Register with RM
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1574102415115
     final status: UNDEFINED
     tracking URL: http://ip-10-101-30-101.ec2.internal:20888/proxy/application_1574102290151_0001/
     user: hadoop
19/11/18 18:40:17 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:18 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:19 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:20 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:21 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:22 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:23 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:24 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:25 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:26 INFO Client: Application report for application_1574102290151_0001 (state: ACCEPTED)
19/11/18 18:40:27 INFO Client: Application report for application_1574102290151_0001 (state: FAILED)
19/11/18 18:40:27 INFO Client: 
     client token: N/A
     diagnostics: Application application_1574102290151_0001 failed 2 times due to AM Container for appattempt_1574102290151_0001_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1574102290151_0001_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-10-101-30-101.ec2.internal:8088/cluster/app/application_1574102290151_0001 Then click on links to logs of each attempt.
. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1574102415115
     final status: FAILED
     tracking URL: http://ip-10-101-30-101.ec2.internal:8088/cluster/app/application_1574102290151_0001
     user: hadoop
19/11/18 18:40:27 ERROR Client: Application diagnostics message: Application application_1574102290151_0001 failed 2 times due to AM Container for appattempt_1574102290151_0001_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1574102290151_0001_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-10-101-30-101.ec2.internal:8088/cluster/app/application_1574102290151_0001 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1574102290151_0001 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1148)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1525)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
    at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:167)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/11/18 18:40:27 INFO ShutdownHookManager: Shutdown hook called
19/11/18 18:40:27 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-4eb32396-6d6c-43f7-bae3-8c32d7327548
19/11/18 18:40:27 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c251bf55-4c00-485a-8947-617394cc3bb4
Command exiting with ret '1'

我可能弄乱了控制台中的某些设置,因为我在本地测试了脚本并且它可以工作。我认为这是我做错事的屏幕:

在"edit software settings"中添加如下配置,我似乎已经解决了我自己的问题:

[{"configurations":[{"classification":"export","properties":{"PYSPARK_PYTHON":/usr/bin/python3"}}] ,"classification":"spark-env","properties":{}}]

您可以查看有详细异常的日志文件,了解您的代码失败的原因。对于日志文件位置,在 EMR 控制台中单击您的集群 -> 单击 Summary 选项卡 -> 在 Configuration details 部分检查 Log URI: 值。 现在转到该日志 URI:S3 上的位置并遵循以下路径:

<log_uri_location>/<cluster_id>/containers/application_<some_random_number>

在上面的位置你会找到stdout.gzstderr.gz,这两个文件都可以帮助你得到确切的异常。