多阶段 docker 构建:统计报告 NVIDIA 文件不存在,而它确实存在

Multistage docker build: stat reports that NVIDIA file does not exist while it does

我正在尝试合并两张 docker 图片。

这是我的 Dockerfile

FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop

COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda

COPY --from=cuda10 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/

构建失败:

$ docker build . -t nvidia-ros:osrf
Step 5/7 : COPY --from=cuda10 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 /usr/lib/x86_64-linux-gnu/
COPY failed: stat usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03: file does not exist

但是这些文件确实存在:

$ docker run -it --rm --gpus all nvidia/cuda:10.0-devel-ubuntu18.04
root@fc9c1d8ccdc2:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root       37 Jan 30 14:13 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.460.32.03
-rw-r--r-- 1 root root 12129448 Aug 20  2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rw-r--r-- 1 root root 10516984 Dec 27 18:55 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03

TL;DR: 这个文件是在 运行 时间 (docs) 安装的,所以它不会在构建时出现。您需要在映像中或容器启动时有几个环境变量,以便 NVIDIA 运行 时间在其中安装驱动程序库。查看最后的 Dockerfile 以获取示例。

为了调查这个,我先 运行 这个命令:

docker run --rm --entrypoint="" -it nvidia/cuda:10.0-devel-ubuntu18.04 \
stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03

得到同样的错误:

stat: cannot stat '/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03': No such file or directory

所以我进入目录并查看 ls:

root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libnvidia-ptxjitcompiler.so
ls: cannot access 'libnvidia-ptxjitcompiler.so': No such file or directory

root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libn
libnccl.so         libnccl_static.a   libnpth.so.0       libnsl.so          libnss_files.so    libnss_nisplus.so  
libnccl.so.2       libnettle.so.6     libnpth.so.0.1.1   libnss_compat.so   libnss_hesiod.so   
libnccl.so.2.6.4   libnettle.so.6.4   libnsl.a           libnss_dns.so      libnss_nis.so      

缺少文件。

然后我使用了你分享的命令:

docker run -it --rm --runtime nvidia nvidia/cuda:10.0-devel-ubuntu18.04

root@4a1602f3d5c0:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root       34 Jan 30 14:48 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.450.66
-rw-r--r-- 1 root root 12129448 Aug 20  2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rwxr-xr-x 1 root root  9947144 Sep 28 10:57 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66

文件在那里,但版本不同,它与我的 NVIDIA 驱动程序版本匹配:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+

所以在我看来这个文件只有在你使用NVIDIA运行启动容器的时候才存在。我用谷歌搜索并找到了确认 here。文档指出您需要 运行 一个带有多个环境变量的容器,以便安装驱动程序库。所以我在官方 NVIDIA 容器中使用了 运行 env 命令,并将带有 NVIDIA_ 前缀的每个变量复制到 Dockerfile 中:

FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop

COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda

ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411
ENV NVIDIA_VISIBLE_DEVICES=all

运行 带有 NVIDIA 的新映像 运行我发现已安装文件的时间:

docker run --runtime nvidia --rm -it afae756457a9

root@7ebdef701231:/# stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
  File: /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
  Size: 9947144         Blocks: 19432      IO Block: 4096   regular file
Device: 801h/2049d      Inode: 131438      Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-01-30 14:48:05.765015216 +0000
Modify: 2020-09-28 10:57:18.067125173 +0000
Change: 2020-09-28 10:57:18.067125173 +0000
 Birth: -