AWS EMR 上的 spark-submit 运行但无法访问 S3

Question

我编写了一个 Spark 应用程序，编译成一个 .jar 文件，我可以在我的 EMR 集群的主节点上从 spark-shell --jars myApplication.jar 运行正常使用它：

scala> // pass in the existing spark context to the doSomething function, run with a particular argument.
scala> com.MyCompany.MyMainClass.doSomething(spark, "dataset1234")
...

像这样一切正常。我还设置了 main 函数，因此我可以使用 spark-shell:

提交

package com.MyCompany
import org.apache.spark.sql.SparkSession
object MyMainClass {
  val spark = SparkSession.builder()
    .master(("local[*]"))
    .appName("myApp")
    .getOrCreate()

  def main(args: Array[String]): Unit = {
    doSomething(spark, args(0))
  }
  
  // implementation of doSomething(...) omitted
}

通过一个非常简单的 main 方法，它只打印出 args，我确认我可以用 spark-submit 调用 main 方法。但是，当我尝试在我的集群上提交我的实际生产作业时，它失败了。我这样提交：

spark-submit --deploy-mode cluster --class com.MyCompany.MyMainClass s3://my-bucket/myApplication.jar dataset1234

在控制台中，我看到了一些消息，包括一些警告，但没有什么特别有用的：

20/11/28 19:28:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/28 19:28:47 WARN DependencyUtils: Skip remote jar s3://my-bucket/myApplication.jar.
20/11/28 19:28:47 INFO RMProxy: Connecting to ResourceManager at ip-xxx-xxx-xxx-xxx.region.compute.internal/172.31.31.156:8032
20/11/28 19:28:47 INFO Client: Requesting a new application from cluster with 20 NodeManagers
20/11/28 19:28:48 INFO Configuration: resource-types.xml not found
20/11/28 19:28:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
20/11/28 19:28:48 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
20/11/28 19:28:48 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/11/28 19:28:48 INFO Client: Setting up container launch context for our AM
20/11/28 19:28:48 INFO Client: Setting up the launch environment for our AM container
20/11/28 19:28:48 INFO Client: Preparing resources for our AM container
20/11/28 19:28:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/11/28 19:28:51 INFO Client: Uploading resource file:/mnt/tmp/spark-e34d573d-8f23-403c-ac41-aa5154db8ecd/__spark_libs__8971082428743972083.zip -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/__spark_libs__8971082428743972083.zip
20/11/28 19:28:53 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
20/11/28 19:28:53 INFO Client: Uploading resource s3://my-bucket/myApplication.jar -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/myApplication.jar
20/11/28 19:28:54 INFO S3NativeFileSystem: Opening 's3://my-bucket/myApplication.jar' for reading
20/11/28 19:28:54 INFO Client: Uploading resource file:/mnt/tmp/spark-e34d573d-8f23-403c-ac41-aa5154db8ecd/__spark_conf__5385616689365996012.zip -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/__spark_conf__.zip
20/11/28 19:28:54 INFO SecurityManager: Changing view acls to: hadoop
20/11/28 19:28:54 INFO SecurityManager: Changing modify acls to: hadoop
20/11/28 19:28:54 INFO SecurityManager: Changing view acls groups to:
20/11/28 19:28:54 INFO SecurityManager: Changing modify acls groups to:
20/11/28 19:28:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/11/28 19:28:54 INFO Client: Submitting application application_1606587406989_0005 to ResourceManager
20/11/28 19:28:54 INFO YarnClientImpl: Submitted application application_1606587406989_0005
20/11/28 19:28:55 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
20/11/28 19:28:55 INFO Client:
         client token: N/A

然后，在几分钟内每秒一次（在本例中为 6），我收到带有 state: ACCEPTED 的应用程序报告，直到它失败并且没有任何有用的信息。

20/11/28 19:28:56 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
...
... (lots of these messages)
...
20/11/28 19:31:55 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
20/11/28 19:34:52 INFO Client: Application report for application_1606587406989_0005 (state: FAILED)
20/11/28 19:34:52 INFO Client:
         client token: N/A
         diagnostics: Application application_1606587406989_0005 failed 2 times due to AM Container for appattempt_1606587406989_0005_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: [2020-11-28 19:32:24.087]Exception from container-launch.
Container id: container_1606587406989_0005_02_000001
Exit code: 13

[2020-11-28 19:32:24.117]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
elled)
20/11/28 19:32:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/11/28 19:32:22 WARN TaskSetManager: Lost task 15.0 in stage 1.0 (TID 135, ip-xxx-xxx-xxx-xxx.region.compute.internal, executor driver): TaskKilled (Stage cancelled)

最终，日志将表明：

org.apache.spark.sql.AnalysisException: Path does not exist: s3://my-bucket/dataset1234.parquet;
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary(DataSource.scala:759)

我的应用程序首先创建这个文件，如果创建失败，就静静地忽略它并继续（以防作业被执行，工作，并再次执行，试图覆盖文件）。第二部分是它将读取这个文件并做一些额外的工作。所以我从这个错误消息中知道我的应用程序是运行，继续通过第一部分，但显然 Spark 无法将文件写出到 S3。从第二条日志消息看来（？），Spark 无法从 S3 下载远程 jar 文件。（我确实在运行 spark-submit 之前将文件复制到 ~hadoop/ 中，尽管我不知道它是否无法从 S3 下载并找到本地副本。）

我通过检查 EMR AWS CLI 导出配置为我在 Web 界面中创建的步骤显示的内容获得了我的 spark-submit 命令。这是 EMR 以某种方式没有 S3 权限的问题吗？这似乎不太可能，但这里还有什么问题呢？这当然是运行我的工作，但它似乎成功地发现该文件不存在（即，对我的存储桶具有读取权限），但它无法创建该文件。

如何获得更好的调试信息？有什么方法可以确保正确的 EMR<-->S3 权限？

Answer 1

摆脱这个.master(("local[*]"))。当运行在集群上并访问 s3 文件时，你的 maser 不应该是本地的

AWS EMR 上的 spark-submit 运行但无法访问 S3

spark-submit on AWS EMR runs but fails on accessing S3

amazon-emr

apache-spark