如何使用 --files 选项执行上传到工作节点的应用程序?
How to execute an application uploaded to worker nodes with --files option?
我正在使用 spark-submit 将文件上传到我的工作节点,我想访问该文件。这个文件是一个二进制文件,我想执行它。我已经知道如何通过 scala 执行该文件,但我一直收到 "File not found" 异常,而且我找不到访问它的方法。
我使用以下命令提交作业。
spark-submit --class Main --master yarn --deploy-mode cluster --files las2las myjar.jar
当作业正在执行时,我注意到它已上传到当前 运行ning 应用程序的暂存目录,当我尝试 运行 以下内容时,它没有用。
val command = "hdfs://url/user/username/.sparkStaging/" + sparkContext.applicationId + "/las2las" !!
这是抛出的异常:
17/10/22 18:15:57 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program "hdfs://url/user/username/.sparkStaging/application_1486393309284_26788/las2las": error=2, No such file or directory
所以,我的问题是,如何访问 las2las 文件?
使用SparkFiles
:
val path = org.apache.spark.SparkFiles.get("las2las")
How can I access the las2las file?
当您转到位于 http://localhost:8088/cluster 的 YARN UI 并单击 Spark 应用程序的应用程序 ID 时,您将被重定向到包含容器日志的页面。单击 日志 。在 stderr 中,您应该找到类似于以下内容的行:
===============================================================================
YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_YARN_STAGING_DIR -> file:/Users/jacek/.sparkStaging/application_1508700955259_0002
SPARK_USER -> jacek
SPARK_YARN_MODE -> true
command:
{{JAVA_HOME}}/bin/java \
-server \
-Xmx1024m \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.worker.ui.port=44444' \
'-Dspark.driver.port=55365' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler@192.168.1.6:55365 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
1 \
--app-id \
application_1508700955259_0002 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__spark_libs__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_libs__618005180363157241.zip" } size: 218111116 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_conf__.zip" } size: 105328 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
===============================================================================
我按如下方式执行我的 Spark 应用程序:
YARN_CONF_DIR=/tmp \
./bin/spark-shell --master yarn --deploy-mode client --files hello.sh
所以感兴趣的行是:
hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
你应该找到与 shell 脚本路径相似的行(我的是 /Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh
)。
This file is a binary, which I would like to execute.
有了线,可以尝试执行一下
import scala.sys.process._
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh" !!
warning: there was one feature warning; re-run with -feature for details
java.io.IOException: Cannot run program "/Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:69)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:113)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:129)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 50 elided
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 54 more
默认情况下它不会工作,因为文件未标记为可执行文件。
$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rw-r--r-- 1 jacek staff 33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
(不知道能不能通知Spark或者YARN把一个文件做成可执行文件)
让我们使文件可执行。
scala> s"chmod +x /Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
res2: String = ""
确实是一个可执行的shell脚本。
$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rwxr-xr-x 1 jacek staff 33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
那就执行吧
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
+ echo 'Hello world'
res3: String =
"Hello world
"
它在以下情况下运行良好hello.sh
:
#!/bin/sh -x
echo "Hello world"
我正在使用 spark-submit 将文件上传到我的工作节点,我想访问该文件。这个文件是一个二进制文件,我想执行它。我已经知道如何通过 scala 执行该文件,但我一直收到 "File not found" 异常,而且我找不到访问它的方法。
我使用以下命令提交作业。
spark-submit --class Main --master yarn --deploy-mode cluster --files las2las myjar.jar
当作业正在执行时,我注意到它已上传到当前 运行ning 应用程序的暂存目录,当我尝试 运行 以下内容时,它没有用。
val command = "hdfs://url/user/username/.sparkStaging/" + sparkContext.applicationId + "/las2las" !!
这是抛出的异常:
17/10/22 18:15:57 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program "hdfs://url/user/username/.sparkStaging/application_1486393309284_26788/las2las": error=2, No such file or directory
所以,我的问题是,如何访问 las2las 文件?
使用SparkFiles
:
val path = org.apache.spark.SparkFiles.get("las2las")
How can I access the las2las file?
当您转到位于 http://localhost:8088/cluster 的 YARN UI 并单击 Spark 应用程序的应用程序 ID 时,您将被重定向到包含容器日志的页面。单击 日志 。在 stderr 中,您应该找到类似于以下内容的行:
===============================================================================
YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_YARN_STAGING_DIR -> file:/Users/jacek/.sparkStaging/application_1508700955259_0002
SPARK_USER -> jacek
SPARK_YARN_MODE -> true
command:
{{JAVA_HOME}}/bin/java \
-server \
-Xmx1024m \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.worker.ui.port=44444' \
'-Dspark.driver.port=55365' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler@192.168.1.6:55365 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
1 \
--app-id \
application_1508700955259_0002 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__spark_libs__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_libs__618005180363157241.zip" } size: 218111116 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_conf__.zip" } size: 105328 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
===============================================================================
我按如下方式执行我的 Spark 应用程序:
YARN_CONF_DIR=/tmp \
./bin/spark-shell --master yarn --deploy-mode client --files hello.sh
所以感兴趣的行是:
hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
你应该找到与 shell 脚本路径相似的行(我的是 /Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh
)。
This file is a binary, which I would like to execute.
有了线,可以尝试执行一下
import scala.sys.process._
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh" !!
warning: there was one feature warning; re-run with -feature for details
java.io.IOException: Cannot run program "/Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:69)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:113)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:129)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 50 elided
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 54 more
默认情况下它不会工作,因为文件未标记为可执行文件。
$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rw-r--r-- 1 jacek staff 33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
(不知道能不能通知Spark或者YARN把一个文件做成可执行文件)
让我们使文件可执行。
scala> s"chmod +x /Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
res2: String = ""
确实是一个可执行的shell脚本。
$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rwxr-xr-x 1 jacek staff 33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
那就执行吧
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
+ echo 'Hello world'
res3: String =
"Hello world
"
它在以下情况下运行良好hello.sh
:
#!/bin/sh -x
echo "Hello world"