EMR 上的 Spark 提交失败 - _"java.lang.IllegalStateException: ... make sure Spark is built."_
Spark Submit failure on EMR - _"java.lang.IllegalStateException: ... make sure Spark is built."_
我无法通过 EMR 上的 spark-submit 提交 Spark 作业。我的 spark-submit 如下所示 -
sudo spark-submit --class timeusage.TimeUsage \
--deploy-mode cluster --master yarn \
--num-executors 2 --conf spark.executor.cores=2 \
--conf spark.executor.memory=2g --conf spark.driver.memory=1g \
--conf spark.driver.cores=1 --conf spark.logConf=true \
--conf spark.yarn.appMasterEnv.SPARKMASTER=yarn \
--conf spark.yarn.appMasterEnv.WAREHOUSEDIR=s3a://whbucket/spark-warehouse \
--conf spark.yarn.appMasterEnv.S3AACCESSKEY=xxx \
--conf spark.yarn.appMasterEnv.S3ASECRETKEY=yyy \
--jars s3://bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
s3:/bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
s3a://sbucket/atussum.csv
错误如下所示 -
19/06/04 07:36:59 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, ip-172-31-66-110.ec2.internal, executor 1): java.lang.ExceptionInInitializerError
at timeusage.TimeUsage$$anonfun.apply(TimeUsage.scala:70)
at timeusage.TimeUsage$$anonfun.apply(TimeUsage.scala:70)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$anon.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Library directory '/mnt/yarn/usercache/root/appcache/application_1559614942233_0036/container_1559614942233_0036_02_000002/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(CommandBuilderUtils.java:342)
at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(YarnCommandBuilderUtils.scala:38)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:543)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:863)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:178)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun.apply(SparkSession.scala:936)
at org.apache.spark.sql.SparkSession$Builder$$anonfun.apply(SparkSession.scala:927)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:927)
at timeusage.TimeUsage$.<init>(TimeUsage.scala:23)
at timeusage.TimeUsage$.<clinit>(TimeUsage.scala)
... 23 more
我已经验证我的项目的构建依赖项都是正确的。该项目适用于本地 [*]。
这是我第一次使用多模块 SBT 项目——我不确定这是否与它有关?
我已经将要执行的程序集 JAR 添加到 --jars 配置中,但它根本没有任何影响。
我的 build.sbt 在这里 - https://github.com/kevvo83/scala-spark-ln/blob/master/build.sbt
预期结果是项目运行完成并在 S3 中创建 Hive 表。
我仍在调查中,一旦我有了它们,就会 post 在这里更新。
在 Harsh 的回答之后,我将这两行添加到我的 spark-submit 命令中 -
--files /usr/lib/spark/conf/hive-site.xml \
--jars s3://bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
现在堆栈跟踪错误是 -
19/06/06 10:37:55 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, ip-172-31-76-146.ec2.internal, executor 1): java.lang.NoClassDefFoundError: Could not initialize class *timeusage.TimeUsage*$
at timeusage.TimeUsage$$anonfun.apply(TimeUsage.scala:70)
at timeusage.TimeUsage$$anonfun.apply(TimeUsage.scala:70)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$anon.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
(仅供参考 timeusage.TimeUsage 是我在 JAR 中的 class)。我还需要添加什么来确保我的 class defs 被接收到吗?
更新:我已经让这个工作了——我相信下面代码片段中的最后 3 个 confs 是有效的(基于文档如何说 Spark 将 Jars 加载到 HDFS 上的暂存区以供执行者访问).
--conf spark.executorEnv.SPARK_HOME=/usr/lib/spark/
--conf spark.yarn.jars=/usr/lib/spark/jars/*.jar
--conf spark.network.timeout=600000
--files /usr/lib/spark/conf/spark-defaults.conf
此外,spark-submit 从本地磁盘执行 Jar - 而不是像我之前错误的那样从 S3 存储桶执行。
将答案标记为正确,因为它让我走上了解决问题的正确轨道。
由于您是 运行 纱线大师模式中的 spark-submit,因此您需要在命令中将 hive-site.xml 作为文件参数传递:
sudo spark-submit --class timeusage.TimeUsage \
--deploy-mode cluster --master yarn \
--num-executors 2 --conf spark.executor.cores=2 \
--conf spark.executor.memory=2g --conf spark.driver.memory=1g \
--conf spark.driver.cores=1 --conf spark.logConf=true \
--conf spark.yarn.appMasterEnv.SPARKMASTER=yarn \
--conf spark.yarn.appMasterEnv.WAREHOUSEDIR=s3a://whbucket/spark-warehouse \
--conf spark.yarn.appMasterEnv.S3AACCESSKEY=xxx \
--conf spark.yarn.appMasterEnv.S3ASECRETKEY=yyy \
--jars s3://bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
--files /usr/lib/spark/conf/hive-site.xml \
s3:/bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
s3a://sbucket/atussum.csv
我无法通过 EMR 上的 spark-submit 提交 Spark 作业。我的 spark-submit 如下所示 -
sudo spark-submit --class timeusage.TimeUsage \
--deploy-mode cluster --master yarn \
--num-executors 2 --conf spark.executor.cores=2 \
--conf spark.executor.memory=2g --conf spark.driver.memory=1g \
--conf spark.driver.cores=1 --conf spark.logConf=true \
--conf spark.yarn.appMasterEnv.SPARKMASTER=yarn \
--conf spark.yarn.appMasterEnv.WAREHOUSEDIR=s3a://whbucket/spark-warehouse \
--conf spark.yarn.appMasterEnv.S3AACCESSKEY=xxx \
--conf spark.yarn.appMasterEnv.S3ASECRETKEY=yyy \
--jars s3://bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
s3:/bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
s3a://sbucket/atussum.csv
错误如下所示 -
19/06/04 07:36:59 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, ip-172-31-66-110.ec2.internal, executor 1): java.lang.ExceptionInInitializerError
at timeusage.TimeUsage$$anonfun.apply(TimeUsage.scala:70)
at timeusage.TimeUsage$$anonfun.apply(TimeUsage.scala:70)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$anon.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Library directory '/mnt/yarn/usercache/root/appcache/application_1559614942233_0036/container_1559614942233_0036_02_000002/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(CommandBuilderUtils.java:342)
at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(YarnCommandBuilderUtils.scala:38)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:543)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:863)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:178)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun.apply(SparkSession.scala:936)
at org.apache.spark.sql.SparkSession$Builder$$anonfun.apply(SparkSession.scala:927)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:927)
at timeusage.TimeUsage$.<init>(TimeUsage.scala:23)
at timeusage.TimeUsage$.<clinit>(TimeUsage.scala)
... 23 more
我已经验证我的项目的构建依赖项都是正确的。该项目适用于本地 [*]。 这是我第一次使用多模块 SBT 项目——我不确定这是否与它有关? 我已经将要执行的程序集 JAR 添加到 --jars 配置中,但它根本没有任何影响。
我的 build.sbt 在这里 - https://github.com/kevvo83/scala-spark-ln/blob/master/build.sbt
预期结果是项目运行完成并在 S3 中创建 Hive 表。 我仍在调查中,一旦我有了它们,就会 post 在这里更新。
在 Harsh 的回答之后,我将这两行添加到我的 spark-submit 命令中 -
--files /usr/lib/spark/conf/hive-site.xml \
--jars s3://bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
现在堆栈跟踪错误是 -
19/06/06 10:37:55 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, ip-172-31-76-146.ec2.internal, executor 1): java.lang.NoClassDefFoundError: Could not initialize class *timeusage.TimeUsage*$
at timeusage.TimeUsage$$anonfun.apply(TimeUsage.scala:70)
at timeusage.TimeUsage$$anonfun.apply(TimeUsage.scala:70)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.next(Iterator.scala:410)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$anon.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
(仅供参考 timeusage.TimeUsage 是我在 JAR 中的 class)。我还需要添加什么来确保我的 class defs 被接收到吗?
更新:我已经让这个工作了——我相信下面代码片段中的最后 3 个 confs 是有效的(基于文档如何说 Spark 将 Jars 加载到 HDFS 上的暂存区以供执行者访问).
--conf spark.executorEnv.SPARK_HOME=/usr/lib/spark/
--conf spark.yarn.jars=/usr/lib/spark/jars/*.jar
--conf spark.network.timeout=600000
--files /usr/lib/spark/conf/spark-defaults.conf
此外,spark-submit 从本地磁盘执行 Jar - 而不是像我之前错误的那样从 S3 存储桶执行。
将答案标记为正确,因为它让我走上了解决问题的正确轨道。
由于您是 运行 纱线大师模式中的 spark-submit,因此您需要在命令中将 hive-site.xml 作为文件参数传递:
sudo spark-submit --class timeusage.TimeUsage \
--deploy-mode cluster --master yarn \
--num-executors 2 --conf spark.executor.cores=2 \
--conf spark.executor.memory=2g --conf spark.driver.memory=1g \
--conf spark.driver.cores=1 --conf spark.logConf=true \
--conf spark.yarn.appMasterEnv.SPARKMASTER=yarn \
--conf spark.yarn.appMasterEnv.WAREHOUSEDIR=s3a://whbucket/spark-warehouse \
--conf spark.yarn.appMasterEnv.S3AACCESSKEY=xxx \
--conf spark.yarn.appMasterEnv.S3ASECRETKEY=yyy \
--jars s3://bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
--files /usr/lib/spark/conf/hive-site.xml \
s3:/bucket/week3-assembly-0.1.0-SNAPSHOT.jar \
s3a://sbucket/atussum.csv