Spark 在本地运行但在 YARN 中运行时找不到文件

Question

我一直在尝试使用 YARN 在集群中向运行提交一个简单的 python 脚本。当我在本地执行作业时，没有问题，一切正常，但是当我运行在集群中执行时，它失败了。

我使用以下命令执行了提交：

spark-submit --master yarn --deploy-mode cluster test.py

我收到的日志错误如下：

17/11/07 13:02:48 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:49 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:50 INFO yarn.Client: Application report for application_1510046813642_0010 (state: FAILED)
17/11/07 13:02:50 INFO yarn.Client: 
     client token: N/A
     diagnostics: Application application_1510046813642_0010 failed 2 times due to AM Container for appattempt_1510046813642_0010_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://myserver:8088/proxy/application_1510046813642_0010/Then, click on links to logs of each attempt.
**Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py**
java.io.FileNotFoundException: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py
    at org.apache.hadoop.hdfs.DistributedFileSystem.doCall(DistributedFileSystem.java:1266)
    at org.apache.hadoop.hdfs.DistributedFileSystem.doCall(DistributedFileSystem.java:1258)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
    at org.apache.hadoop.yarn.util.FSDownload.access[=11=]0(FSDownload.java:61)
    at org.apache.hadoop.yarn.util.FSDownload.run(FSDownload.java:359)
    at org.apache.hadoop.yarn.util.FSDownload.run(FSDownload.java:357)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: root.users.josholsan
     start time: 1510056155796
     final status: FAILED
     tracking URL: http://myserver:8088/cluster/app/application_1510046813642_0010
     user: josholsan
Exception in thread "main" org.apache.spark.SparkException: Application application_1510046813642_0010 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1025)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1072)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/11/07 13:02:50 INFO util.ShutdownHookManager: Shutdown hook called
17/11/07 13:02:50 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5cc8bf5e-216b-4d9e-b66d-9dc01a94e851

我特别注意这一行

诊断：文件不存在：hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py

不知道为什么找不到test.py，我也试过放到HDFS中执行作业的用户目录下：/user/josholsan/

为了完成我的 post 我还想分享我的 test.py 脚本：

from pyspark import SparkContext

file="/user/josholsan/concepts_copy.csv"
sc = SparkContext("local","Test app")
textFile = sc.textFile(file).cache()

linesWithOMOP=textFile.filter(lambda line: "OMOP" in line).count()
linesWithICD=textFile.filter(lambda line: "ICD" in line).count()

print("Lines with OMOP: %i, lines with ICD9: %i" % (linesWithOMOP,linesWithICD))

难道错误也在这里？:

sc = SparkContext("local","Test app")

非常感谢您的提前帮助。

Answer 1

转自评论区：

sc = SparkContext("local","Test app")：这里有"local"将覆盖任何命令行设置；来自 docs：

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
test.py 文件必须放在整个集群可见的地方。例如。 spark-submit --master yarn --deploy-mode cluster http://somewhere/accessible/to/master/and/workers/test.py
可以使用 --py-files 参数指定任何额外的文件和资源（不幸的是，在 mesos 中测试，而不是在 yarn 中测试），例如--py-files http://somewhere/accessible/to/all/extra_python_code_my_code_uses.zip

编辑：正如@desertnaut 评论的那样，这个参数应该在要执行的脚本之前使用。
yarn logs -applicationId <app ID> 将为您提供所提交作业的输出结果。更多here and here

希望这对您有所帮助，祝您好运！

Spark 在本地运行但在 YARN 中运行时找不到文件

Spark runs in local but can't find file when running in YARN

hadoop-yarn

apache-spark

pyspark

Spark 在本地运行但在 YARN 中 运行 时找不到文件

Spark runs in local but can't find file when running in YARN

hadoop-yarn

apache-spark

pyspark

Spark 在本地运行但在 YARN 中运行时找不到文件