是否可以使用同时安装了 pyspark 和 pandas 的 docker 图像?

Is it possible to use a docker image that has both pyspark and pandas installed?

我的烧瓶应用程序使用 pandas 和 pyspark。

我创建了一个 Dockerfile,它使用 docker Pandas 图像:

FROM amancevice/pandas
RUN mkdir /app
ADD . /app
WORKDIR /app
EXPOSE 5000
RUN pip install -r requirements.txt
CMD ["python", "app.py"]

在requirements.txt我有:

flask
pymysql
sqlalchemy
passlib
hdfs
Werkzeug
pandas
pyspark

我在这个函数中使用 pyspark 的地方(这只是一个验证它是否有效的例子):

from pyspark.sql import SparkSession

@app.route('/home/search', methods=["GET", "POST"])
def search_tab():
    if 'loggedin' in session:
        user_id = 'user' + str(session['id'])

        if request.method == 'POST':
            checkboxData = request.form.getlist("checkboxData")

            for cd in checkboxData:
                if cd.endswith(".csv"):
                    data_hdfs(user_id, cd)
                else:
                    print("xml")

            return render_template("search.html", id=session['id'])
    return render_template('login.html')


def data_hdfs(user_id, cd):
    #spark session
    warehouse_location ='hdfs://hdfs-nn:9000/flask_platform'

    spark = SparkSession \
        .builder \
        .master("local[2]") \
        .appName("read csv") \
        .config("spark.sql.warehouse.dir", warehouse_location) \
        .getOrCreate()

    raw_data = spark.read.options(header='True', delimiter=';').csv("hdfs://hdfs-nn:9000"+cd)

    raw_data.repartition(1).write.format('csv').option('header',True).mode('overwrite').option('sep',';').save("hdfs://hdfs-nn:9000/flask_platform/"+user_id+"/staging_area/mapped_files/mapped_file_4.csv")

    return spark.stop()

但是当我尝试在 pyspark 中使用代码时,出现了这个错误:

JAVA_HOME is not set
172.20.0.1 - - [15/Apr/2022 11:58:16] "POST /home/search HTTP/1.1" 500 -
Traceback (most recent call last):
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2095, in __call__
     return self.wsgi_app(environ, start_response)
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2080, in wsgi_app
     response = self.handle_exception(e)
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2077, in wsgi_app
     response = self.full_dispatch_request()
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1525, in full_dis
     rv = self.handle_user_exception(e)
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1523, in full_dis
     rv = self.dispatch_request()
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1509, in dispatch
     return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
   File "/app/app.py", line 243, in search_tab
     data_hdfs(user_id, cd)
   File "/app/app.py", line 255, in data_hdfs
     spark = SparkSession \
   File "/usr/local/lib/python3.9/site-packages/pyspark/sql/session.py", line 228, in
     sc = SparkContext.getOrCreate(sparkConf)
   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 392, in get
     SparkContext(conf=conf or SparkConf())
   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 144, in __i
     SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 339, in _en
     SparkContext._gateway = gateway or launch_gateway(conf)
   File "/usr/local/lib/python3.9/site-packages/pyspark/java_gateway.py", line 108, i
     raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number

是否可以使用同时安装了 pyspark 和 pandas 的 docker 映像?如果是这样,我在哪里可以找到它? 因为我需要在我的项目中使用两者。 谢谢

pyspark(又名 Spark)需要 java,它似乎没有安装在您的映像中。

您可以尝试类似的方法:

FROM amancevice/pandas

RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         openjdk-11-jre-headless \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

RUN pip install -r requirements.txt
RUN mkdir /app
ADD . /app
WORKDIR /app
EXPOSE 5000
CMD ["python", "app.py"]

请注意,在添加您的代码之前,我还移动了您的 requrements.txt 安装。这将通过使用 docker 缓存来节省您的时间。