pyspark 容器- spark-提交 pyspark 脚本会抛出文件未找到错误
pyspark container- spark-submitting a pyspark script throws file not found error
解决方案-
将以下环境变量添加到容器
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9
正在尝试创建 spark 容器和spark-submit
pyspark 脚本。
我能够创建容器,但是 运行 pyspark 脚本抛出以下错误:
Exception in thread "main" java.io.IOException: Cannot run program
"python": error=2, No such file or directory
问题:
- 知道为什么会出现这个错误吗?
- 我需要单独安装 python 还是它与 spark 下载捆绑在一起?
- 我需要单独安装 Pyspark 还是它与 spark 下载捆绑在一起?
- 关于 python 安装有什么优点?下载并放在
/opt/python
下或使用 apt-get
?
pyspark 脚本:
from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)
火花提交的输出:
newuser@c1f28230da16:~$ spark-submit count.py
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor
java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting
this to the maintainers of org.apache.spark.unsafe.Platform WARNING:
Use --illegal-access=warn to enable warnings of further illegal
reflective access operations WARNING: All illegal access operations
will be denied in a future release 21/02/01 19:58:35 WARN
NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable Exception in
thread "main" java.io.IOException: Cannot run program "python":
error=2, No such file or directory at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) at
org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at
org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564) at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at
org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at
org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused
by: java.io.IOException: error=2, No such file or directory at
java.base/java.lang.ProcessImpl.forkAndExec(Native Method) at
java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:319) at
java.base/java.lang.ProcessImpl.start(ProcessImpl.java:250) at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
... 15 more log4j:WARN No appenders could be found for logger
(org.apache.spark.util.ShutdownHookManager). log4j:WARN Please
initialize the log4j system properly. log4j:WARN See
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
printenv 的输出:
newuser@c1f28230da16:~$ printenv
HOME=/home/newuser
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
PYTHONPATH=:/opt/spark/python:/opt/spark/python/lib/py4j-0.10.4-src.zip
TERM=xterm SHLVL=1 SPARK_HOME=/opt/spark
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/java/bin:/opt/spark/bin
_=/usr/bin/printenv
myspark dockerfile:
JDK_PACKAGE=openjdk-14.0.2_linux-x64_bin.tar.gz ARG
SPARK_HOME=/opt/spark ARG SPARK_PACKAGE=spark-3.0.1-bin-hadoop3.2.tgz
#MAINTAINER demo@gmail.com
#LABEL maintainer="demo@foo.com"
############################################
### Install openjava
############################################
# Base image stage 1 FROM ubuntu as jdk
ARG JAVA_HOME ARG JDK_PACKAGE
WORKDIR /opt/
## download open java
# ADD https://download.java.net/java/GA/jdk14.0.2/205943a0976c4ed48cb16f1043c5c647/12/GPL/$JDK_PACKAGE
/
# ADD $JDK_PACKAGE / COPY $JDK_PACKAGE .
RUN mkdir -p $JAVA_HOME/ && \
tar -zxf $JDK_PACKAGE --strip-components 1 -C $JAVA_HOME && \
rm -f $JDK_PACKAGE
############################################
### Install spark search
############################################
# Base image stage 2 From ubuntu as spark
#ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE
WORKDIR /opt/
## download spark COPY $SPARK_PACKAGE .
RUN mkdir -p $SPARK_HOME/ && \
tar -zxf $SPARK_PACKAGE --strip-components 1 -C $SPARK_HOME && \
rm -f $SPARK_PACKAGE
# Mount elasticsearch.yml config
### ADD config/elasticsearch.yml /opt/elasticsearch/config/elasticsearch.yml
############################################
### final
############################################
From ubuntu as finalbuild
ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE
WORKDIR /opt/
# get artifacts from previous stages COPY --from=jdk $JAVA_HOME $JAVA_HOME COPY --from=spark $SPARK_HOME $SPARK_HOME
# Setup JAVA_HOME, this is useful for docker commandline ENV JAVA_HOME $JAVA_HOME ENV SPARK_HOME $SPARK_HOME
# setup paths ENV PATH $PATH:$JAVA_HOME/bin ENV PATH $PATH:$SPARK_HOME/bin ENV PYTHONPATH
$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip
# Expose ports
# EXPOSE 9200
# EXPOSE 9300
# Define mountable directories.
#VOLUME ["/data"]
## give permission to entire setup directory RUN useradd newuser --create-home --shell /bin/bash && \
echo 'newuser:newpassword' | chpasswd && \
chown -R newuser $SPARK_HOME $JAVA_HOME && \
chown -R newuser:newuser /home/newuser && \
chmod 755 /home/newuser
#chown -R newuser:newuser /home/newuser
#chown -R newuser /home/newuser && \
# Install Python RUN apt-get update && \
apt-get install -yq curl && \
apt-get install -yq vim && \
apt-get install -yq python3.9
## Install PySpark and Numpy
#RUN \
# pip install --upgrade pip && \
# pip install numpy && \
# pip install pyspark
#
USER newuser
WORKDIR /home/newuser
# RUN chown -R newuser /home/newuser
向容器添加了以下 env 变量并且它起作用了
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9
解决方案-
将以下环境变量添加到容器
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9
正在尝试创建 spark 容器和spark-submit
pyspark 脚本。
我能够创建容器,但是 运行 pyspark 脚本抛出以下错误:
Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory
问题:
- 知道为什么会出现这个错误吗?
- 我需要单独安装 python 还是它与 spark 下载捆绑在一起?
- 我需要单独安装 Pyspark 还是它与 spark 下载捆绑在一起?
- 关于 python 安装有什么优点?下载并放在
/opt/python
下或使用apt-get
?
pyspark 脚本:
from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)
火花提交的输出:
newuser@c1f28230da16:~$ spark-submit count.py
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 21/02/01 19:58:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.IOException: error=2, No such file or directory at java.base/java.lang.ProcessImpl.forkAndExec(Native Method) at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:319) at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:250) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107) ... 15 more log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
printenv 的输出:
newuser@c1f28230da16:~$ printenv
HOME=/home/newuser LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36: PYTHONPATH=:/opt/spark/python:/opt/spark/python/lib/py4j-0.10.4-src.zip TERM=xterm SHLVL=1 SPARK_HOME=/opt/spark PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/java/bin:/opt/spark/bin _=/usr/bin/printenv
myspark dockerfile:
JDK_PACKAGE=openjdk-14.0.2_linux-x64_bin.tar.gz ARG SPARK_HOME=/opt/spark ARG SPARK_PACKAGE=spark-3.0.1-bin-hadoop3.2.tgz #MAINTAINER demo@gmail.com #LABEL maintainer="demo@foo.com" ############################################ ### Install openjava ############################################ # Base image stage 1 FROM ubuntu as jdk ARG JAVA_HOME ARG JDK_PACKAGE WORKDIR /opt/ ## download open java # ADD https://download.java.net/java/GA/jdk14.0.2/205943a0976c4ed48cb16f1043c5c647/12/GPL/$JDK_PACKAGE / # ADD $JDK_PACKAGE / COPY $JDK_PACKAGE . RUN mkdir -p $JAVA_HOME/ && \ tar -zxf $JDK_PACKAGE --strip-components 1 -C $JAVA_HOME && \ rm -f $JDK_PACKAGE ############################################ ### Install spark search ############################################ # Base image stage 2 From ubuntu as spark #ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE WORKDIR /opt/ ## download spark COPY $SPARK_PACKAGE . RUN mkdir -p $SPARK_HOME/ && \ tar -zxf $SPARK_PACKAGE --strip-components 1 -C $SPARK_HOME && \ rm -f $SPARK_PACKAGE # Mount elasticsearch.yml config ### ADD config/elasticsearch.yml /opt/elasticsearch/config/elasticsearch.yml ############################################ ### final ############################################ From ubuntu as finalbuild ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE WORKDIR /opt/ # get artifacts from previous stages COPY --from=jdk $JAVA_HOME $JAVA_HOME COPY --from=spark $SPARK_HOME $SPARK_HOME # Setup JAVA_HOME, this is useful for docker commandline ENV JAVA_HOME $JAVA_HOME ENV SPARK_HOME $SPARK_HOME # setup paths ENV PATH $PATH:$JAVA_HOME/bin ENV PATH $PATH:$SPARK_HOME/bin ENV PYTHONPATH $PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip # Expose ports # EXPOSE 9200 # EXPOSE 9300 # Define mountable directories. #VOLUME ["/data"] ## give permission to entire setup directory RUN useradd newuser --create-home --shell /bin/bash && \ echo 'newuser:newpassword' | chpasswd && \ chown -R newuser $SPARK_HOME $JAVA_HOME && \ chown -R newuser:newuser /home/newuser && \ chmod 755 /home/newuser #chown -R newuser:newuser /home/newuser #chown -R newuser /home/newuser && \ # Install Python RUN apt-get update && \ apt-get install -yq curl && \ apt-get install -yq vim && \ apt-get install -yq python3.9 ## Install PySpark and Numpy #RUN \ # pip install --upgrade pip && \ # pip install numpy && \ # pip install pyspark # USER newuser WORKDIR /home/newuser # RUN chown -R newuser /home/newuser
向容器添加了以下 env 变量并且它起作用了
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9