无法从 pipenv virtualenv 导入 pyspark,因为它找不到 py4j
Cannot import pyspark from a pipenv virtualenv as it cannot find py4j
我构建了一个包含 spark 和 pipenv 的 docker 映像。如果我 运行 python 在 pipenv virtualenv 中并尝试导入 pyspark,它会失败并显示错误 "ModuleNotFoundError: No module named 'py4j'"
root@4d0ae585a52a:/tmp# pipenv run python -c "from pyspark.sql import SparkSession"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/spark/python/pyspark/__init__.py", line 46, in <module>
from pyspark.context import SparkContext
File "/opt/spark/python/pyspark/context.py", line 29, in <module>
from py4j.protocol import Py4JError
ModuleNotFoundError: No module named 'py4j'
但是,如果我 运行 在同一个 virtualenv 中使用 pyspark,则不会出现这样的问题:
root@4d0ae585a52a:/tmp# pipenv run pyspark
Python 3.7.4 (default, Sep 12 2019, 16:02:06)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 10:18:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 10:18:33 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Python version 3.7.4 (default, Sep 12 2019 16:02:06)
SparkSession available as 'spark'.
>>> spark.createDataFrame([('Alice',)], ['name']).collect()
[Row(name='Alice')]
我承认我从其他地方为我的 Dockerfile 复制了很多代码,所以我并不完全了解它是如何在幕后联系在一起的。我希望在 PYTHONPATH 上安装 py4j 就足够了,但显然不行。我可以确认它在 PYTHONPATH 中并且存在:
root@4d0ae585a52a:/tmp# pipenv run python -c "import os;print(os.environ['PYTHONPATH'])"
/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:
root@4d0ae585a52a:/tmp# pipenv run ls /opt/spark/python/lib/py4j*
/opt/spark/python/lib/py4j-0.10.4-src.zip
任何人都可以建议我可以做些什么来让我的 python 解释器在我的 virtualenv 中使用 py4j 吗?
这是 Dockerfile。我们从本地 jfrog 工件缓存中提取工件(docker 图像、apt 包、pypi 包等),因此此处所有工件引用:
FROM images.artifactory.our.org.com/python3-7-pipenv:1.0
WORKDIR /tmp
ENV SPARK_VERSION=2.2.1
ENV HADOOP_VERSION=2.8.4
ARG ARTIFACTORY_USER
ARG ARTIFACTORY_ENCRYPTED_PASSWORD
ARG ARTIFACTORY_PATH=artifactory.our.org.com/artifactory/generic-dev/ceng/external-dependencies
ARG SPARK_BINARY_PATH=https://${ARTIFACTORY_PATH}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz
ARG HADOOP_BINARY_PATH=https://${ARTIFACTORY_PATH}/hadoop-${HADOOP_VERSION}.tar.gz
ADD apt-transport-https_1.4.8_amd64.deb /tmp
RUN echo "deb https://username:password@artifactory.our.org.com/artifactory/debian-main-remote stretch main" >/etc/apt/sources.list.d/main.list &&\
echo "deb https://username:password@artifactory.our.org.com/artifactory/maria-db-debian stretch main" >>/etc/apt/sources.list.d/main.list &&\
echo 'Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/02update &&\
echo 'Acquire::http::Timeout "10";' > /etc/apt/apt.conf.d/99timeout &&\
echo 'Acquire::ftp::Timeout "10";' >> /etc/apt/apt.conf.d/99timeout &&\
dpkg -i /tmp/apt-transport-https_1.4.8_amd64.deb &&\
apt-get install --allow-unauthenticated -y /tmp/apt-transport-https_1.4.8_amd64.deb &&\
apt-get update --allow-unauthenticated -y -o Dir::Etc::sourcelist="sources.list.d/main.list" -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
RUN apt-get update && \
apt-get -y install default-jdk
# Detect JAVA_HOME and export in bashrc.
# This will result in something like this being added to /etc/bash.bashrc
# export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo export JAVA_HOME="$(readlink -f /usr/bin/java | sed "s:/jre/bin/java::")" >> /etc/bash.bashrc
# Configure Spark-${SPARK_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${SPARK_BINARY_PATH}" -o /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& cd /opt \
&& tar -xzf /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& rm spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark \
&& sed -i '/log4j.rootCategory=INFO, console/c\log4j.rootCategory=CRITICAL, console' /opt/spark/conf/log4j.properties.template \
&& mv /opt/spark/conf/log4j.properties.template /opt/spark/conf/log4j.properties \
&& mkdir /opt/spark-optional-jars/ \
&& mv /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf \
&& printf "spark.driver.extraClassPath /opt/spark-optional-jars/*\nspark.executor.extraClassPath /opt/spark-optional-jars/*\n">>/opt/spark/conf/spark-defaults.conf \
&& printf "spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby" >> /opt/spark/conf/spark-defaults.conf
# Configure Hadoop-${HADOOP_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${HADOOP_BINARY_PATH}" -o /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& cd /opt \
&& tar -xzf /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& rm /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& ln -s hadoop-${HADOOP_VERSION} hadoop
# Set Environment Variables.
ENV SPARK_HOME="/opt/spark" \
HADOOP_HOME="/opt/hadoop" \
PYSPARK_SUBMIT_ARGS="--master=lo cal[*] pyspark-shell --executor-memory 1g --driver-memory 1g --conf spark.ui.enabled=false spark.executor.extrajavaoptions=-Xmx=1024m" \
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH" \
PATH="$PATH:/opt/spark/bin:/opt/hadoop/bin" \
PYSPARK_DRIVER_PYTHON="/usr/local/bin/python" \
PYSPARK_PYTHON="/usr/local/bin/python"
# Upgrade pip and setuptools
RUN pip install --index-url https://username:password@artifactory.our.org.com/artifactory/api/pypi/pypi-virtual-all/simple --upgrade pip setuptools
我想我已经通过安装独立的 py4j 解决了这个问题:
$ docker run --rm -it images.artifactory.our.org.com/myimage:mytag bash
root@1d6a0ec725f0:/tmp# pipenv install py4j
Installing py4j…
✔ Installation Succeeded
Pipfile.lock (49f1d8) out of date, updating to (dfdbd6)…
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
✔ Success!
Updated Pipfile.lock (49f1d8)!
Installing dependencies from Pipfile.lock (49f1d8)…
▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 42/42 — 00:00:06
To activate this projects virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.
root@1d6a0ec725f0:/tmp# pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 13:05:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 13:05:48 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
[Row(name='Alice')]
root@1d6a0ec725f0:/tmp#
不完全确定为什么我必须给出 py4j 已经在 PYTHONPATH 上,但到目前为止它似乎没问题,所以我很高兴。如果有人能阐明为什么它在没有明确安装 py4j 的情况下不起作用,我很想知道。我只能假设我的 Dockerfile 中的这一行:
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH"
没有成功地让解释器知道 py4j。
只是为了确认(如果有帮助的话)pip 认为 py4j 和 pyspark 安装到的位置:
root@1d6a0ec725f0:/tmp# pipenv run pip show pyspark
Name: pyspark
Version: 2.2.1
Summary: Apache Spark Python API
Home-page: https://github.com/apache/spark/tree/master/python
Author: Spark Developers
Author-email: dev@spark.apache.org
License: http://www.apache.org/licenses/LICENSE-2.0
Location: /opt/spark-2.2.1-bin-hadoop2.7/python
Requires: py4j
Required-by:
root@1d6a0ec725f0:/tmp# pipenv run pip show py4j
Name: py4j
Version: 0.10.8.1
Summary: Enables Python programs to dynamically access arbitrary Java objects
Home-page: https://www.py4j.org/
Author: Barthelemy Dagenais
Author-email: barthelemy@infobart.com
License: BSD License
Location: /root/.local/share/virtualenvs/tmp-XVr6zr33/lib/python3.7/site-packages
Requires:
Required-by: pyspark
root@1d6a0ec725f0:/tmp#
另一个解决方案,解压py4j zip文件作为安装spark的Dockerfile阶段的一部分,然后相应地设置PYTHONPATH:
unzip spark/python/lib/py4j-*-src.zip -d spark/python/lib/
...
...
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib:$PYTHONPATH"
这实际上是最好的解决方案。这是新的 Dockerfile:
FROM images.artifactory.our.org.com/python3-7-pipenv:1.0
WORKDIR /tmp
ENV SPARK_VERSION=2.2.1
ENV HADOOP_VERSION=2.8.4
ARG ARTIFACTORY_USER
ARG ARTIFACTORY_ENCRYPTED_PASSWORD
ARG ARTIFACTORY_PATH=artifactory.our.org.com/artifactory/generic-dev/ceng/external-dependencies
ARG SPARK_BINARY_PATH=https://${ARTIFACTORY_PATH}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz
ARG HADOOP_BINARY_PATH=https://${ARTIFACTORY_PATH}/hadoop-${HADOOP_VERSION}.tar.gz
ADD apt-transport-https_1.4.8_amd64.deb /tmp
RUN echo "deb https://username:password@artifactory.our.org.com/artifactory/debian-main-remote stretch main" >/etc/apt/sources.list.d/main.list &&\
echo "deb https://username:password@artifactory.our.org.com/artifactory/maria-db-debian stretch main" >>/etc/apt/sources.list.d/main.list &&\
echo 'Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/02update &&\
echo 'Acquire::http::Timeout "10";' > /etc/apt/apt.conf.d/99timeout &&\
echo 'Acquire::ftp::Timeout "10";' >> /etc/apt/apt.conf.d/99timeout &&\
dpkg -i /tmp/apt-transport-https_1.4.8_amd64.deb &&\
apt-get install --allow-unauthenticated -y /tmp/apt-transport-https_1.4.8_amd64.deb &&\
apt-get update --allow-unauthenticated -y -o Dir::Etc::sourcelist="sources.list.d/main.list" -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
RUN apt-get update && \
apt-get -y install default-jdk
# Detect JAVA_HOME and export in bashrc.
# This will result in something like this being added to /etc/bash.bashrc
# export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo export JAVA_HOME="$(readlink -f /usr/bin/java | sed "s:/jre/bin/java::")" >> /etc/bash.bashrc
# Configure Spark-${SPARK_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${SPARK_BINARY_PATH}" -o /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& cd /opt \
&& tar -xzf /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& rm spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark \
&& unzip spark/python/lib/py4j-*-src.zip -d spark/python/lib/ \
&& sed -i '/log4j.rootCategory=INFO, console/c\log4j.rootCategory=CRITICAL, console' /opt/spark/conf/log4j.properties.template \
&& mv /opt/spark/conf/log4j.properties.template /opt/spark/conf/log4j.properties \
&& mkdir /opt/spark-optional-jars/ \
&& mv /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf \
&& printf "spark.driver.extraClassPath /opt/spark-optional-jars/*\nspark.executor.extraClassPath /opt/spark-optional-jars/*\n">>/opt/spark/conf/spark-defaults.conf \
&& printf "spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby" >> /opt/spark/conf/spark-defaults.conf
# Configure Hadoop-${HADOOP_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${HADOOP_BINARY_PATH}" -o /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& cd /opt \
&& tar -xzf /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& rm /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& ln -s hadoop-${HADOOP_VERSION} hadoop
# Set Environment Variables.
ENV SPARK_HOME="/opt/spark" \
HADOOP_HOME="/opt/hadoop" \
PYSPARK_SUBMIT_ARGS="--master=local[*] pyspark-shell --executor-memory 1g --driver-memory 1g --conf spark.ui.enabled=false spark.executor.extrajavaoptions=-Xmx=1024m" \
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib:$PYTHONPATH" \
PATH="$PATH:/opt/spark/bin:/opt/hadoop/bin" \
PYSPARK_DRIVER_PYTHON="/usr/local/bin/python" \
PYSPARK_PYTHON="/usr/local/bin/python"
# Upgrade pip and setuptools
RUN pip install --index-url https://username:password@artifactory.our.org.com/artifactory/api/pypi/pypi-virtual-all/simple --upgrade pip setuptools
所以显然我不能将 zip 文件放在 PYTHONPATH 上并让 python 解释器可以使用该 zip 文件的内容。正如我上面所说,我从其他地方复制了该代码,所以为什么它适用于其他人而不是我......我不知道。哦,现在一切正常。
这是一个很好的命令来检查它是否一切正常:
docker run --rm -it myimage:mytag pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"
这是 运行 命令的输出:
$ docker run --rm -it myimage:mytag pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 15:53:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 15:53:55 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
19/10/16 15:53:55 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
19/10/16 15:53:56 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
[Row(name='Alice')]
我构建了一个包含 spark 和 pipenv 的 docker 映像。如果我 运行 python 在 pipenv virtualenv 中并尝试导入 pyspark,它会失败并显示错误 "ModuleNotFoundError: No module named 'py4j'"
root@4d0ae585a52a:/tmp# pipenv run python -c "from pyspark.sql import SparkSession"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/spark/python/pyspark/__init__.py", line 46, in <module>
from pyspark.context import SparkContext
File "/opt/spark/python/pyspark/context.py", line 29, in <module>
from py4j.protocol import Py4JError
ModuleNotFoundError: No module named 'py4j'
但是,如果我 运行 在同一个 virtualenv 中使用 pyspark,则不会出现这样的问题:
root@4d0ae585a52a:/tmp# pipenv run pyspark
Python 3.7.4 (default, Sep 12 2019, 16:02:06)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 10:18:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 10:18:33 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Python version 3.7.4 (default, Sep 12 2019 16:02:06)
SparkSession available as 'spark'.
>>> spark.createDataFrame([('Alice',)], ['name']).collect()
[Row(name='Alice')]
我承认我从其他地方为我的 Dockerfile 复制了很多代码,所以我并不完全了解它是如何在幕后联系在一起的。我希望在 PYTHONPATH 上安装 py4j 就足够了,但显然不行。我可以确认它在 PYTHONPATH 中并且存在:
root@4d0ae585a52a:/tmp# pipenv run python -c "import os;print(os.environ['PYTHONPATH'])"
/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:
root@4d0ae585a52a:/tmp# pipenv run ls /opt/spark/python/lib/py4j*
/opt/spark/python/lib/py4j-0.10.4-src.zip
任何人都可以建议我可以做些什么来让我的 python 解释器在我的 virtualenv 中使用 py4j 吗?
这是 Dockerfile。我们从本地 jfrog 工件缓存中提取工件(docker 图像、apt 包、pypi 包等),因此此处所有工件引用:
FROM images.artifactory.our.org.com/python3-7-pipenv:1.0
WORKDIR /tmp
ENV SPARK_VERSION=2.2.1
ENV HADOOP_VERSION=2.8.4
ARG ARTIFACTORY_USER
ARG ARTIFACTORY_ENCRYPTED_PASSWORD
ARG ARTIFACTORY_PATH=artifactory.our.org.com/artifactory/generic-dev/ceng/external-dependencies
ARG SPARK_BINARY_PATH=https://${ARTIFACTORY_PATH}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz
ARG HADOOP_BINARY_PATH=https://${ARTIFACTORY_PATH}/hadoop-${HADOOP_VERSION}.tar.gz
ADD apt-transport-https_1.4.8_amd64.deb /tmp
RUN echo "deb https://username:password@artifactory.our.org.com/artifactory/debian-main-remote stretch main" >/etc/apt/sources.list.d/main.list &&\
echo "deb https://username:password@artifactory.our.org.com/artifactory/maria-db-debian stretch main" >>/etc/apt/sources.list.d/main.list &&\
echo 'Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/02update &&\
echo 'Acquire::http::Timeout "10";' > /etc/apt/apt.conf.d/99timeout &&\
echo 'Acquire::ftp::Timeout "10";' >> /etc/apt/apt.conf.d/99timeout &&\
dpkg -i /tmp/apt-transport-https_1.4.8_amd64.deb &&\
apt-get install --allow-unauthenticated -y /tmp/apt-transport-https_1.4.8_amd64.deb &&\
apt-get update --allow-unauthenticated -y -o Dir::Etc::sourcelist="sources.list.d/main.list" -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
RUN apt-get update && \
apt-get -y install default-jdk
# Detect JAVA_HOME and export in bashrc.
# This will result in something like this being added to /etc/bash.bashrc
# export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo export JAVA_HOME="$(readlink -f /usr/bin/java | sed "s:/jre/bin/java::")" >> /etc/bash.bashrc
# Configure Spark-${SPARK_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${SPARK_BINARY_PATH}" -o /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& cd /opt \
&& tar -xzf /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& rm spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark \
&& sed -i '/log4j.rootCategory=INFO, console/c\log4j.rootCategory=CRITICAL, console' /opt/spark/conf/log4j.properties.template \
&& mv /opt/spark/conf/log4j.properties.template /opt/spark/conf/log4j.properties \
&& mkdir /opt/spark-optional-jars/ \
&& mv /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf \
&& printf "spark.driver.extraClassPath /opt/spark-optional-jars/*\nspark.executor.extraClassPath /opt/spark-optional-jars/*\n">>/opt/spark/conf/spark-defaults.conf \
&& printf "spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby" >> /opt/spark/conf/spark-defaults.conf
# Configure Hadoop-${HADOOP_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${HADOOP_BINARY_PATH}" -o /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& cd /opt \
&& tar -xzf /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& rm /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& ln -s hadoop-${HADOOP_VERSION} hadoop
# Set Environment Variables.
ENV SPARK_HOME="/opt/spark" \
HADOOP_HOME="/opt/hadoop" \
PYSPARK_SUBMIT_ARGS="--master=lo cal[*] pyspark-shell --executor-memory 1g --driver-memory 1g --conf spark.ui.enabled=false spark.executor.extrajavaoptions=-Xmx=1024m" \
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH" \
PATH="$PATH:/opt/spark/bin:/opt/hadoop/bin" \
PYSPARK_DRIVER_PYTHON="/usr/local/bin/python" \
PYSPARK_PYTHON="/usr/local/bin/python"
# Upgrade pip and setuptools
RUN pip install --index-url https://username:password@artifactory.our.org.com/artifactory/api/pypi/pypi-virtual-all/simple --upgrade pip setuptools
我想我已经通过安装独立的 py4j 解决了这个问题:
$ docker run --rm -it images.artifactory.our.org.com/myimage:mytag bash
root@1d6a0ec725f0:/tmp# pipenv install py4j
Installing py4j…
✔ Installation Succeeded
Pipfile.lock (49f1d8) out of date, updating to (dfdbd6)…
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
✔ Success!
Updated Pipfile.lock (49f1d8)!
Installing dependencies from Pipfile.lock (49f1d8)…
▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 42/42 — 00:00:06
To activate this projects virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.
root@1d6a0ec725f0:/tmp# pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 13:05:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 13:05:48 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
[Row(name='Alice')]
root@1d6a0ec725f0:/tmp#
不完全确定为什么我必须给出 py4j 已经在 PYTHONPATH 上,但到目前为止它似乎没问题,所以我很高兴。如果有人能阐明为什么它在没有明确安装 py4j 的情况下不起作用,我很想知道。我只能假设我的 Dockerfile 中的这一行:
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH"
没有成功地让解释器知道 py4j。
只是为了确认(如果有帮助的话)pip 认为 py4j 和 pyspark 安装到的位置:
root@1d6a0ec725f0:/tmp# pipenv run pip show pyspark
Name: pyspark
Version: 2.2.1
Summary: Apache Spark Python API
Home-page: https://github.com/apache/spark/tree/master/python
Author: Spark Developers
Author-email: dev@spark.apache.org
License: http://www.apache.org/licenses/LICENSE-2.0
Location: /opt/spark-2.2.1-bin-hadoop2.7/python
Requires: py4j
Required-by:
root@1d6a0ec725f0:/tmp# pipenv run pip show py4j
Name: py4j
Version: 0.10.8.1
Summary: Enables Python programs to dynamically access arbitrary Java objects
Home-page: https://www.py4j.org/
Author: Barthelemy Dagenais
Author-email: barthelemy@infobart.com
License: BSD License
Location: /root/.local/share/virtualenvs/tmp-XVr6zr33/lib/python3.7/site-packages
Requires:
Required-by: pyspark
root@1d6a0ec725f0:/tmp#
另一个解决方案,解压py4j zip文件作为安装spark的Dockerfile阶段的一部分,然后相应地设置PYTHONPATH:
unzip spark/python/lib/py4j-*-src.zip -d spark/python/lib/
...
...
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib:$PYTHONPATH"
这实际上是最好的解决方案。这是新的 Dockerfile:
FROM images.artifactory.our.org.com/python3-7-pipenv:1.0
WORKDIR /tmp
ENV SPARK_VERSION=2.2.1
ENV HADOOP_VERSION=2.8.4
ARG ARTIFACTORY_USER
ARG ARTIFACTORY_ENCRYPTED_PASSWORD
ARG ARTIFACTORY_PATH=artifactory.our.org.com/artifactory/generic-dev/ceng/external-dependencies
ARG SPARK_BINARY_PATH=https://${ARTIFACTORY_PATH}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz
ARG HADOOP_BINARY_PATH=https://${ARTIFACTORY_PATH}/hadoop-${HADOOP_VERSION}.tar.gz
ADD apt-transport-https_1.4.8_amd64.deb /tmp
RUN echo "deb https://username:password@artifactory.our.org.com/artifactory/debian-main-remote stretch main" >/etc/apt/sources.list.d/main.list &&\
echo "deb https://username:password@artifactory.our.org.com/artifactory/maria-db-debian stretch main" >>/etc/apt/sources.list.d/main.list &&\
echo 'Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/02update &&\
echo 'Acquire::http::Timeout "10";' > /etc/apt/apt.conf.d/99timeout &&\
echo 'Acquire::ftp::Timeout "10";' >> /etc/apt/apt.conf.d/99timeout &&\
dpkg -i /tmp/apt-transport-https_1.4.8_amd64.deb &&\
apt-get install --allow-unauthenticated -y /tmp/apt-transport-https_1.4.8_amd64.deb &&\
apt-get update --allow-unauthenticated -y -o Dir::Etc::sourcelist="sources.list.d/main.list" -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
RUN apt-get update && \
apt-get -y install default-jdk
# Detect JAVA_HOME and export in bashrc.
# This will result in something like this being added to /etc/bash.bashrc
# export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo export JAVA_HOME="$(readlink -f /usr/bin/java | sed "s:/jre/bin/java::")" >> /etc/bash.bashrc
# Configure Spark-${SPARK_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${SPARK_BINARY_PATH}" -o /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& cd /opt \
&& tar -xzf /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& rm spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
&& ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark \
&& unzip spark/python/lib/py4j-*-src.zip -d spark/python/lib/ \
&& sed -i '/log4j.rootCategory=INFO, console/c\log4j.rootCategory=CRITICAL, console' /opt/spark/conf/log4j.properties.template \
&& mv /opt/spark/conf/log4j.properties.template /opt/spark/conf/log4j.properties \
&& mkdir /opt/spark-optional-jars/ \
&& mv /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf \
&& printf "spark.driver.extraClassPath /opt/spark-optional-jars/*\nspark.executor.extraClassPath /opt/spark-optional-jars/*\n">>/opt/spark/conf/spark-defaults.conf \
&& printf "spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby" >> /opt/spark/conf/spark-defaults.conf
# Configure Hadoop-${HADOOP_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${HADOOP_BINARY_PATH}" -o /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& cd /opt \
&& tar -xzf /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& rm /opt/hadoop-${HADOOP_VERSION}.tar.gz \
&& ln -s hadoop-${HADOOP_VERSION} hadoop
# Set Environment Variables.
ENV SPARK_HOME="/opt/spark" \
HADOOP_HOME="/opt/hadoop" \
PYSPARK_SUBMIT_ARGS="--master=local[*] pyspark-shell --executor-memory 1g --driver-memory 1g --conf spark.ui.enabled=false spark.executor.extrajavaoptions=-Xmx=1024m" \
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib:$PYTHONPATH" \
PATH="$PATH:/opt/spark/bin:/opt/hadoop/bin" \
PYSPARK_DRIVER_PYTHON="/usr/local/bin/python" \
PYSPARK_PYTHON="/usr/local/bin/python"
# Upgrade pip and setuptools
RUN pip install --index-url https://username:password@artifactory.our.org.com/artifactory/api/pypi/pypi-virtual-all/simple --upgrade pip setuptools
所以显然我不能将 zip 文件放在 PYTHONPATH 上并让 python 解释器可以使用该 zip 文件的内容。正如我上面所说,我从其他地方复制了该代码,所以为什么它适用于其他人而不是我......我不知道。哦,现在一切正常。
这是一个很好的命令来检查它是否一切正常:
docker run --rm -it myimage:mytag pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"
这是 运行 命令的输出:
$ docker run --rm -it myimage:mytag pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 15:53:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 15:53:55 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
19/10/16 15:53:55 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
19/10/16 15:53:56 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
[Row(name='Alice')]