pyspark 容器- spark-提交 pyspark 脚本会抛出文件未找到错误

pyspark container- spark-submitting a pyspark script throws file not found error

解决方案-

将以下环境变量添加到容器

export PYSPARK_PYTHON=/usr/bin/python3.9

export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9


正在尝试创建 spark 容器和spark-submit pyspark 脚本。

我能够创建容器,但是 运行 pyspark 脚本抛出以下错误:

Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory

问题:

  1. 知道为什么会出现这个错误吗?
  2. 我需要单独安装 python 还是它与 spark 下载捆绑在一起?
  3. 我需要单独安装 Pyspark 还是它与 spark 下载捆绑在一起?
  4. 关于 python 安装有什么优点?下载并放在 /opt/python 下或使用 apt-get ?

pyspark 脚本:

from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
   ["scala", 
   "java", 
   "hadoop", 
   "spark", 
   "akka",
   "spark vs hadoop", 
   "pyspark",
   "pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)

火花提交的输出:

newuser@c1f28230da16:~$ spark-submit count.py

WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor
java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting
this to the maintainers of org.apache.spark.unsafe.Platform WARNING:
Use --illegal-access=warn to enable warnings of further illegal
reflective access operations WARNING: All illegal access operations
will be denied in a future release 21/02/01 19:58:35 WARN
NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable Exception in
thread "main" java.io.IOException: Cannot run program "python":
error=2, No such file or directory    at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)    at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)    at
org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)     at
org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)     at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)   at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.base/java.lang.reflect.Method.invoke(Method.java:564)   at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
  at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
  at
org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:180)
  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
  at
org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:1007)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused
by: java.io.IOException: error=2, No such file or directory   at
java.base/java.lang.ProcessImpl.forkAndExec(Native Method)    at
java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:319)  at
java.base/java.lang.ProcessImpl.start(ProcessImpl.java:250)   at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
  ... 15 more log4j:WARN No appenders could be found for logger
(org.apache.spark.util.ShutdownHookManager). log4j:WARN Please
initialize the log4j system properly. log4j:WARN See
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

printenv 的输出:

newuser@c1f28230da16:~$ printenv

HOME=/home/newuser
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
PYTHONPATH=:/opt/spark/python:/opt/spark/python/lib/py4j-0.10.4-src.zip
TERM=xterm SHLVL=1 SPARK_HOME=/opt/spark
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/java/bin:/opt/spark/bin
_=/usr/bin/printenv

myspark dockerfile:

JDK_PACKAGE=openjdk-14.0.2_linux-x64_bin.tar.gz ARG
SPARK_HOME=/opt/spark ARG SPARK_PACKAGE=spark-3.0.1-bin-hadoop3.2.tgz


#MAINTAINER demo@gmail.com
#LABEL maintainer="demo@foo.com"


############################################
###  Install openjava
############################################

# Base image stage 1 FROM ubuntu as jdk

ARG JAVA_HOME ARG JDK_PACKAGE

WORKDIR /opt/

## download open java
#  ADD https://download.java.net/java/GA/jdk14.0.2/205943a0976c4ed48cb16f1043c5c647/12/GPL/$JDK_PACKAGE
/
#  ADD $JDK_PACKAGE / COPY $JDK_PACKAGE .

RUN mkdir -p $JAVA_HOME/ && \
    tar -zxf $JDK_PACKAGE --strip-components 1  -C $JAVA_HOME  && \
    rm -f $JDK_PACKAGE


############################################
###  Install spark search
############################################

# Base image stage 2 From ubuntu as spark

#ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE

WORKDIR /opt/

## download spark COPY $SPARK_PACKAGE .

RUN mkdir -p $SPARK_HOME/  && \
    tar -zxf $SPARK_PACKAGE --strip-components 1  -C $SPARK_HOME  && \
    rm -f $SPARK_PACKAGE

# Mount elasticsearch.yml config
### ADD config/elasticsearch.yml /opt/elasticsearch/config/elasticsearch.yml

############################################
###  final
############################################

From ubuntu as finalbuild

ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE

WORKDIR /opt/

# get artifacts from previous stages COPY --from=jdk $JAVA_HOME $JAVA_HOME COPY --from=spark $SPARK_HOME   $SPARK_HOME

# Setup JAVA_HOME, this is useful for docker commandline ENV JAVA_HOME $JAVA_HOME ENV SPARK_HOME $SPARK_HOME

# setup paths ENV PATH $PATH:$JAVA_HOME/bin ENV PATH $PATH:$SPARK_HOME/bin ENV PYTHONPATH
$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip




# Expose ports
# EXPOSE 9200
# EXPOSE 9300

# Define mountable directories.
#VOLUME ["/data"]


## give permission to entire setup directory RUN useradd newuser --create-home --shell /bin/bash  && \
    echo 'newuser:newpassword' | chpasswd && \
    chown -R newuser $SPARK_HOME $JAVA_HOME  && \
    chown -R newuser:newuser /home/newuser && \
    chmod 755 /home/newuser
    #chown -R newuser:newuser /home/newuser
    #chown -R newuser /home/newuser  && \

# Install Python RUN apt-get update && \
    apt-get install -yq curl  && \
    apt-get install -yq vim  && \
    apt-get install -yq  python3.9



## Install PySpark and Numpy
#RUN \
#    pip install --upgrade pip && \
#    pip install numpy && \
#    pip install pyspark
#

USER newuser

WORKDIR /home/newuser

# RUN  chown -R newuser /home/newuser

向容器添加了以下 env 变量并且它起作用了

export PYSPARK_PYTHON=/usr/bin/python3.9

export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9