在 docker Alpine 中安装 pandas

Installing pandas in docker Alpine

真的 很难尝试在 docker 中安装稳定的数据科学包配置。使用这种主流的相关工具,这应该会更容易。

以下是 使用 工作的 Dockerfile打包核心并单独安装,指定 pandas<0.21.0,因为据称,更高版本与 numpy.

冲突
    FROM alpine:3.6

    ENV PACKAGES="\
    dumb-init \
    musl \
    libc6-compat \
    linux-headers \
    build-base \
    bash \
    git \
    ca-certificates \
    freetype \
    libgfortran \
    libgcc \
    libstdc++ \
    openblas \
    tcl \
    tk \
    libssl1.0 \
    "

ENV PYTHON_PACKAGES="\
    numpy \
    matplotlib \
    scipy \
    scikit-learn \
    nltk \
    " 

RUN apk add --no-cache --virtual build-dependencies python3 \
    && apk add --virtual build-runtime \
    build-base python3-dev openblas-dev freetype-dev pkgconfig gfortran \
    && ln -s /usr/include/locale.h /usr/include/xlocale.h \
    && python3 -m ensurepip \
    && rm -r /usr/lib/python*/ensurepip \
    && pip3 install --upgrade pip setuptools \
    && ln -sf /usr/bin/python3 /usr/bin/python \
    && ln -sf pip3 /usr/bin/pip \
    && rm -r /root/.cache \
    && pip install --no-cache-dir $PYTHON_PACKAGES \
    && pip3 install 'pandas<0.21.0' \    #<---------- PANDAS
    && apk del build-runtime \
    && apk add --no-cache --virtual build-dependencies $PACKAGES \
    && rm -rf /var/cache/apk/*

# set working directory
WORKDIR /usr/src/app

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt # other than data science packages go here
RUN pip install -r requirements.txt

# add entrypoint.sh
COPY ./entrypoint.sh /usr/src/app/entrypoint.sh

RUN chmod +x /usr/src/app/entrypoint.sh

# add app
COPY . /usr/src/app

# run server
CMD ["/usr/src/app/entrypoint.sh"]

上面的配置曾经有效。 现在 发生的事情是构建确实通过了,但是 pandas 在导入时失败 错误如下:

ImportError: Missing required dependencies ['numpy']

自从安装了 numpy 1.16.1,我不知道 numpy pandas 正在尝试寻找哪个...

有谁知道如何获得稳定的解决方案吗?

注意:一个解决方案,包括从用于数据科学的交钥匙 docker 图像中拉取至少上述包到上面的 Dockerfile , 也会很受欢迎。


EDIT 1:

如果我按照评论中的建议将数据包安装移动到 requirements.txt,如下所示:

requirements.txt

(...)
numpy==1.16.1 # or numpy==1.16.0
scikit-learn==0.20.2
scipy==1.2.1
nltk==3.4   
pandas==0.24.1 # or pandas== 0.23.4
matplotlib==3.0.2 
(...)

Dockerfile:

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt
RUN pip install -r requirements.txt

它在 pandas 处再次中断,抱怨 numpy

Collecting numpy==1.16.1 (from -r requirements.txt (line 61))
  Downloading https://files.pythonhosted.org/packages/2b/26/07472b0de91851b6656cbc86e2f0d5d3a3128e7580f23295ef58b6862d6c/numpy-1.16.1.zip (5.1MB)
Collecting scikit-learn==0.20.2 (from -r requirements.txt (line 62))
  Downloading https://files.pythonhosted.org/packages/49/0e/8312ac2d7f38537361b943c8cde4b16dadcc9389760bb855323b67bac091/scikit-learn-0.20.2.tar.gz (10.3MB)
Collecting scipy==1.2.1 (from -r requirements.txt (line 63))
  Downloading https://files.pythonhosted.org/packages/a9/b4/5598a706697d1e2929eaf7fe68898ef4bea76e4950b9efbe1ef396b8813a/scipy-1.2.1.tar.gz (23.1MB)
Collecting nltk==3.4 (from -r requirements.txt (line 64))
  Downloading https://files.pythonhosted.org/packages/6f/ed/9c755d357d33bc1931e157f537721efb5b88d2c583fe593cc09603076cc3/nltk-3.4.zip (1.4MB)
Collecting pandas==0.24.1 (from -r requirements.txt (line 65))
  Downloading https://files.pythonhosted.org/packages/81/fd/b1f17f7dc914047cd1df9d6813b944ee446973baafe8106e4458bfb68884/pandas-0.24.1.tar.gz (11.8MB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 359, in get_provider
        module = sys.modules[moduleOrReq]
    KeyError: 'numpy'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 732, in <module>
        ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 475, in maybe_cythonize
        numpy_incl = pkg_resources.resource_filename('numpy', 'core/include')
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1144, in resource_filename
        return get_provider(package_or_requirement).get_resource_filename(
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 361, in get_provider
        __import__(moduleOrReq)
    ModuleNotFoundError: No module named 'numpy'

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-_e5z6o6_/pandas/

EDIT 2:

这似乎是一个悬而未决的 pandas 问题。更多详情请参考:

pandas-dev github

"Unfortunately, this means that a requirements.txt file is insufficient for setting up a new environment with pandas installed (like in a docker container)".

  **ImportError**:

  IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

  Importing the multiarray numpy extension module failed.  Most
  likely you are trying to import a failed build of numpy.
  Here is how to proceed:
  - If you're working with a numpy git repository, try `git clean -xdf`
    (removes all files not under version control) and rebuild numpy.
  - If you are simply trying to use the numpy version that you have installed:
    your installation is broken - please reinstall numpy.
  - If you have already reinstalled and that did not fix the problem, then:
    1. Check that you are using the Python you expect (you're using /usr/local/bin/python),
       and that you have no directories in your PATH or PYTHONPATH that can
       interfere with the Python and numpy versions you're trying to use.
    2. If (1) looks fine, you can open a new issue at
       https://github.com/numpy/numpy/issues.  Please include details on:
       - how you installed Python
       - how you installed numpy
       - your operating system
       - whether or not you have multiple versions of Python installed
       - if you built from source, your compiler versions and ideally a build log

EDIT 3

requirements.txt ---> https://pastebin.com/0icnx0iu


EDIT 4

自 2020 年 1 月 12 日起,已接受的解决方案开始不再有效。 现在,在构建 scipy's 轮子时,构建中断点不是在 pandas,而是在 scipy,而是在 numpy 之后。这是日志:

  ----------------------------------------
  ERROR: Failed building wheel for scipy
  Running setup.py clean for scipy
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s6nahssd/scipy/setup.py'"'"'; __file__='"'"'/tmp/pip-install-s6nahssd/scipy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
       cwd: /tmp/pip-install-s6nahssd/scipy
  Complete output (9 lines):

  `setup.py clean` is not supported, use one of the following instead:

    - `git clean -xdf` (cleans all files)
    - `git clean -Xdf` (cleans all versioned files, doesn't touch
                        files that aren't checked into the git repo)

  Add `--force` to your command to use it anyway if you must (unsupported).

  ----------------------------------------
  ERROR: Failed cleaning build dir for scipy
Successfully built numpy
Failed to build scipy
ERROR: Could not build wheels for scipy which use PEP 517 and cannot be installed directly

从错误看来构建过程使用的是python3.6,而我使用的是FROM alpine:3.7.

完整日志在这里 -> https://pastebin.com/Tw4ubxSA

这是当前的 Dockerfile:

https://pastebin.com/3SftEufx

尝试将此添加到您的 requirements.txt 文件中:

numpy==1.16.0
pandas==0.23.4

我从昨天开始就遇到了同样的错误,这个改变帮我解决了。

如果您未绑定到 Alpine 3.6,则使用 Alpine 3.7(或更高版本)应该可以。

在 Alpine 3.6 上,我安装 matplotlib 失败,原因如下:

Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/26/04/8b381d5b166508cc258632b225adbafec49bbe69aa9a4fa1f1b461428313/matplotlib-3.0.3.tar.gz (36.6MB)
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.org/simple/numpy/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    No local packages or working download links found for numpy>=1.10.0

然而,在 Alpine 3.7 上,它有效。这可能是由于 numpy 版本控制问题(参见 here),但我无法确定。解决了这个问题,包被成功构建和安装——花了很长时间,大约 30 分钟(因为 Alpine 的 musl-libc 与 Python 的 Wheels 格式不兼容,所有使用 pip 安装的包都必须构建来自来源)。

请注意,需要进行一项重要更改:您应该只删除 pip install 之后的 build-runtime 虚拟包 (apk del build-runtime)。此外,如果适用,您可以将 numpy 1.16.1 替换为 1.16.2,这是已发布的版本(否则 1.16.2 将被卸载并从源代码构建 1.16.1,进一步增加构建time) - 不过我还没试过。

作为参考,这是我稍微修改过的Dockerfile and docker build output

注:

通常选择Alpine作为最小化图像大小的基础(Alpine在其他方面也很流畅,但由于glibc/musl与大陆Linux应用程序存在兼容性问题)。必须从源代码中构建 Python 包才能达到这个目的,因为你会得到一个非常臃肿的图像——在任何清理之前有 900MB,这也需要很长时间来构建。通过删除所有中间编译工件、构建依赖项等,可以大大压缩图像,但仍然。

如果您无法获得在 Alpine 上工作所需的 Python 软件包版本,而无需从源代码构建它们,我建议您尝试其他更小且更兼容的基础映像,例如 debian-slim,甚至 ubuntu.

编辑:

在 "Edit 3" 之后增加了要求,这里更新了 Dockerfile and Docker build output。 添加了以下包以满足构建依赖性:

postgresql-dev libffi-dev libressl-dev libxml2 libxml2-dev libxslt libxslt-dev libjpeg-turbo-dev zlib-dev

对于由于特定 headers 导致构建失败的包,我使用 Alpine 的包内容搜索来定位丢失的包。 专门针对 cffi,缺少 ffi.h header,需要 libffi-dev 包:https://pkgs.alpinelinux.org/contents?file=ffi.h&path=&name=&branch=v3.7.

或者,当包构建失败不是很清楚时,可以参考具体包的安装说明,例如Pillow.

压缩前的新图像大小为 1.04GB。为了减少一点,您可以删除 Python 和 pip 缓存:

RUN apk del build-runtime && \
    find -type d -name __pycache__ -prune -exec rm -rf {} \; && \
    rm -rf ~/.cache/pip

当使用 docker build --squash 时,这将使图像大小减小到 661MB。

这可能不是完全相关的,因为这是在 Alpine 中搜索 numpy/pandas 安装失败时弹出的第一个答案,我添加这个答案。

以下修复对我有用(但安装时间更长pandas/numpy)

apk update
apk --no-cache add curl gcc g++
ln -s /usr/include/locale.h /usr/include/xlocale.h

现在尝试安装 pandas/numpy

上的一个较早的问答涉及。

如果您的目标是在不了解具体细节的情况下获得稳定的解决方案,对于 python 3,您可以构建以下内容(复制并粘贴我在 中的回答)

FROM python:3.7-alpine
RUN echo "@testing http://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
RUN apk add --update --no-cache py3-numpy py3-pandas@testing

如果您的目标是了解如何实现稳定的构建,那里的讨论和相关图片可能也有帮助...

来自 python:3.8-高山

运行 apk --update 添加 gcc build-base freetype-dev libpng-dev openblas-dev

运行 pip 安装--no-cache-dir matplotlib pandas