多阶段 docker 构建:统计报告 NVIDIA 文件不存在,而它确实存在
Multistage docker build: stat reports that NVIDIA file does not exist while it does
我正在尝试合并两张 docker 图片。
这是我的 Dockerfile
FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop
COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda
COPY --from=cuda10 \
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 \
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 \
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libcuda.so.410.129 \
/usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/
构建失败:
$ docker build . -t nvidia-ros:osrf
Step 5/7 : COPY --from=cuda10 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 /usr/lib/x86_64-linux-gnu/
COPY failed: stat usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03: file does not exist
但是这些文件确实存在:
$ docker run -it --rm --gpus all nvidia/cuda:10.0-devel-ubuntu18.04
root@fc9c1d8ccdc2:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root 37 Jan 30 14:13 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.460.32.03
-rw-r--r-- 1 root root 12129448 Aug 20 2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rw-r--r-- 1 root root 10516984 Dec 27 18:55 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03
TL;DR: 这个文件是在 运行 时间 (docs) 安装的,所以它不会在构建时出现。您需要在映像中或容器启动时有几个环境变量,以便 NVIDIA 运行 时间在其中安装驱动程序库。查看最后的 Dockerfile 以获取示例。
为了调查这个,我先 运行 这个命令:
docker run --rm --entrypoint="" -it nvidia/cuda:10.0-devel-ubuntu18.04 \
stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03
得到同样的错误:
stat: cannot stat '/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03': No such file or directory
所以我进入目录并查看 ls
:
root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libnvidia-ptxjitcompiler.so
ls: cannot access 'libnvidia-ptxjitcompiler.so': No such file or directory
root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libn
libnccl.so libnccl_static.a libnpth.so.0 libnsl.so libnss_files.so libnss_nisplus.so
libnccl.so.2 libnettle.so.6 libnpth.so.0.1.1 libnss_compat.so libnss_hesiod.so
libnccl.so.2.6.4 libnettle.so.6.4 libnsl.a libnss_dns.so libnss_nis.so
缺少文件。
然后我使用了你分享的命令:
docker run -it --rm --runtime nvidia nvidia/cuda:10.0-devel-ubuntu18.04
root@4a1602f3d5c0:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root 34 Jan 30 14:48 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.450.66
-rw-r--r-- 1 root root 12129448 Aug 20 2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rwxr-xr-x 1 root root 9947144 Sep 28 10:57 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
文件在那里,但版本不同,它与我的 NVIDIA 驱动程序版本匹配:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
所以在我看来这个文件只有在你使用NVIDIA运行启动容器的时候才存在。我用谷歌搜索并找到了确认 here。文档指出您需要 运行 一个带有多个环境变量的容器,以便安装驱动程序库。所以我在官方 NVIDIA 容器中使用了 运行 env
命令,并将带有 NVIDIA_
前缀的每个变量复制到 Dockerfile 中:
FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop
COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411
ENV NVIDIA_VISIBLE_DEVICES=all
运行 带有 NVIDIA 的新映像 运行我发现已安装文件的时间:
docker run --runtime nvidia --rm -it afae756457a9
root@7ebdef701231:/# stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
File: /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
Size: 9947144 Blocks: 19432 IO Block: 4096 regular file
Device: 801h/2049d Inode: 131438 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-01-30 14:48:05.765015216 +0000
Modify: 2020-09-28 10:57:18.067125173 +0000
Change: 2020-09-28 10:57:18.067125173 +0000
Birth: -
我正在尝试合并两张 docker 图片。
这是我的 Dockerfile
FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop
COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda
COPY --from=cuda10 \
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 \
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 \
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libcuda.so.410.129 \
/usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/
构建失败:
$ docker build . -t nvidia-ros:osrf
Step 5/7 : COPY --from=cuda10 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 /usr/lib/x86_64-linux-gnu/
COPY failed: stat usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03: file does not exist
但是这些文件确实存在:
$ docker run -it --rm --gpus all nvidia/cuda:10.0-devel-ubuntu18.04
root@fc9c1d8ccdc2:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root 37 Jan 30 14:13 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.460.32.03
-rw-r--r-- 1 root root 12129448 Aug 20 2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rw-r--r-- 1 root root 10516984 Dec 27 18:55 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03
TL;DR: 这个文件是在 运行 时间 (docs) 安装的,所以它不会在构建时出现。您需要在映像中或容器启动时有几个环境变量,以便 NVIDIA 运行 时间在其中安装驱动程序库。查看最后的 Dockerfile 以获取示例。
为了调查这个,我先 运行 这个命令:
docker run --rm --entrypoint="" -it nvidia/cuda:10.0-devel-ubuntu18.04 \
stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03
得到同样的错误:
stat: cannot stat '/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03': No such file or directory
所以我进入目录并查看 ls
:
root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libnvidia-ptxjitcompiler.so
ls: cannot access 'libnvidia-ptxjitcompiler.so': No such file or directory
root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libn
libnccl.so libnccl_static.a libnpth.so.0 libnsl.so libnss_files.so libnss_nisplus.so
libnccl.so.2 libnettle.so.6 libnpth.so.0.1.1 libnss_compat.so libnss_hesiod.so
libnccl.so.2.6.4 libnettle.so.6.4 libnsl.a libnss_dns.so libnss_nis.so
缺少文件。
然后我使用了你分享的命令:
docker run -it --rm --runtime nvidia nvidia/cuda:10.0-devel-ubuntu18.04
root@4a1602f3d5c0:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root 34 Jan 30 14:48 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.450.66
-rw-r--r-- 1 root root 12129448 Aug 20 2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rwxr-xr-x 1 root root 9947144 Sep 28 10:57 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
文件在那里,但版本不同,它与我的 NVIDIA 驱动程序版本匹配:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
所以在我看来这个文件只有在你使用NVIDIA运行启动容器的时候才存在。我用谷歌搜索并找到了确认 here。文档指出您需要 运行 一个带有多个环境变量的容器,以便安装驱动程序库。所以我在官方 NVIDIA 容器中使用了 运行 env
命令,并将带有 NVIDIA_
前缀的每个变量复制到 Dockerfile 中:
FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop
COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411
ENV NVIDIA_VISIBLE_DEVICES=all
运行 带有 NVIDIA 的新映像 运行我发现已安装文件的时间:
docker run --runtime nvidia --rm -it afae756457a9
root@7ebdef701231:/# stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
File: /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
Size: 9947144 Blocks: 19432 IO Block: 4096 regular file
Device: 801h/2049d Inode: 131438 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-01-30 14:48:05.765015216 +0000
Modify: 2020-09-28 10:57:18.067125173 +0000
Change: 2020-09-28 10:57:18.067125173 +0000
Birth: -