"docker:19.03-dind" 无法 select 具有功能的设备驱动程序 "nvidia":[[gpu]]
"docker:19.03-dind" could not select device driver "nvidia" with capabilities: [[gpu]]
我遇到了 K8S+DinD 问题:
- 启动 Kubernetes 集群
- 在此集群中启动一个主 docker 映像和一个 DinD 映像
- 当 运行 作业请求 GPU 时,出现错误
could not select device driver "nvidia" with capabilities: [[gpu]]
完全错误
http://localhost:2375/v1.40/containers/long-hash-string/start: Internal Server Error ("could not select device driver "nvidia" with capabilities: [[gpu]]")
exec
到K8S pod内的DinD镜像,nvidia-smi
不可用
一些调试,似乎是因为 DinD 缺少 Nvidia-docker-工具包,当我直接在我的本地笔记本电脑上 运行 相同的工作时,我遇到了同样的错误 docker,我通过安装 nvidia-docker2 sudo apt-get install -y nvidia-docker2
.
修复了同样的错误
我在想也许我可以尝试将 nvidia-docker2 安装到 DinD 19.03 (docker:19.03-dind),但不确定该怎么做?通过多阶段 docker 构建?
非常感谢!
更新:
吊舱规格:
spec:
containers:
- name: dind-daemon
image: docker:19.03-dind
我自己搞定了。
参考
First, I modified the ubuntu-dind image (https://github.com/billyteves/ubuntu-dind) to install nvidia-docker (i.e. added the instructions in the nvidia-docker site to the Dockerfile) and changed it to be based on nvidia/cuda:9.2-runtime-ubuntu16.04.
Then I created a pod with two containers, a frontend ubuntu container and the a privileged docker daemon container as a sidecar. The sidecar's image is the modified one I mentioned above.
但由于这个post是3年前的事了,我确实花了很多时间来匹配依赖版本,3年的回购迁移等
我修改的 Dockerfile 版本来构建它
ARG CUDA_IMAGE=nvidia/cuda:11.0.3-runtime-ubuntu20.04
FROM ${CUDA_IMAGE}
ARG DOCKER_CE_VERSION=5:18.09.1~3-0~ubuntu-xenial
RUN apt-get update -q && \
apt-get install -yq \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common && \
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && \
add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable" && \
apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io
# https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
RUN set -eux; \
apt-get update -q && \
apt-get install -yq \
btrfs-progs \
e2fsprogs \
iptables \
xfsprogs \
xz-utils \
# pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
pigz \
# zfs \
wget
# set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
RUN set -x \
&& addgroup --system dockremap \
&& adduser --system -ingroup dockremap dockremap \
&& echo 'dockremap:165536:65536' >> /etc/subuid \
&& echo 'dockremap:165536:65536' >> /etc/subgid
# https://github.com/docker/docker/tree/master/hack/dind
ENV DIND_COMMIT 37498f009d8bf25fbb6199e8ccd34bed84f2874b
RUN set -eux; \
wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
chmod +x /usr/local/bin/dind
##### Install nvidia docker #####
# Add the package repositories
RUN curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add --no-tty -
RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
echo $distribution && \
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
tee /etc/apt/sources.list.d/nvidia-docker.list
RUN apt-get update -qq --fix-missing
RUN apt-get install -yq nvidia-docker2
RUN sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json
RUN mkdir -p /usr/local/bin/
COPY dockerd-entrypoint.sh /usr/local/bin/
RUN chmod 777 /usr/local/bin/dockerd-entrypoint.sh
RUN ln -s /usr/local/bin/dockerd-entrypoint.sh /
VOLUME /var/lib/docker
EXPOSE 2375
ENTRYPOINT ["dockerd-entrypoint.sh"]
#ENTRYPOINT ["/bin/sh", "/shared/dockerd-entrypoint.sh"]
CMD []
当我使用 exec
登录到 Docker-in-Docker 容器时,我可以成功 运行 nvidia-smi
(之前 return not found 错误然后不能运行 任何 GPU 资源相关 docker 运行)
欢迎拉取我的镜像到brandsight/dind:nvidia-docker
我遇到了 K8S+DinD 问题:
- 启动 Kubernetes 集群
- 在此集群中启动一个主 docker 映像和一个 DinD 映像
- 当 运行 作业请求 GPU 时,出现错误
could not select device driver "nvidia" with capabilities: [[gpu]]
完全错误
http://localhost:2375/v1.40/containers/long-hash-string/start: Internal Server Error ("could not select device driver "nvidia" with capabilities: [[gpu]]")
exec
到K8S pod内的DinD镜像,nvidia-smi
不可用
一些调试,似乎是因为 DinD 缺少 Nvidia-docker-工具包,当我直接在我的本地笔记本电脑上 运行 相同的工作时,我遇到了同样的错误 docker,我通过安装 nvidia-docker2 sudo apt-get install -y nvidia-docker2
.
我在想也许我可以尝试将 nvidia-docker2 安装到 DinD 19.03 (docker:19.03-dind),但不确定该怎么做?通过多阶段 docker 构建?
非常感谢!
更新:
吊舱规格:
spec:
containers:
- name: dind-daemon
image: docker:19.03-dind
我自己搞定了。
参考
First, I modified the ubuntu-dind image (https://github.com/billyteves/ubuntu-dind) to install nvidia-docker (i.e. added the instructions in the nvidia-docker site to the Dockerfile) and changed it to be based on nvidia/cuda:9.2-runtime-ubuntu16.04.
Then I created a pod with two containers, a frontend ubuntu container and the a privileged docker daemon container as a sidecar. The sidecar's image is the modified one I mentioned above.
但由于这个post是3年前的事了,我确实花了很多时间来匹配依赖版本,3年的回购迁移等
我修改的 Dockerfile 版本来构建它
ARG CUDA_IMAGE=nvidia/cuda:11.0.3-runtime-ubuntu20.04
FROM ${CUDA_IMAGE}
ARG DOCKER_CE_VERSION=5:18.09.1~3-0~ubuntu-xenial
RUN apt-get update -q && \
apt-get install -yq \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common && \
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && \
add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable" && \
apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io
# https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
RUN set -eux; \
apt-get update -q && \
apt-get install -yq \
btrfs-progs \
e2fsprogs \
iptables \
xfsprogs \
xz-utils \
# pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
pigz \
# zfs \
wget
# set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
RUN set -x \
&& addgroup --system dockremap \
&& adduser --system -ingroup dockremap dockremap \
&& echo 'dockremap:165536:65536' >> /etc/subuid \
&& echo 'dockremap:165536:65536' >> /etc/subgid
# https://github.com/docker/docker/tree/master/hack/dind
ENV DIND_COMMIT 37498f009d8bf25fbb6199e8ccd34bed84f2874b
RUN set -eux; \
wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
chmod +x /usr/local/bin/dind
##### Install nvidia docker #####
# Add the package repositories
RUN curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add --no-tty -
RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
echo $distribution && \
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
tee /etc/apt/sources.list.d/nvidia-docker.list
RUN apt-get update -qq --fix-missing
RUN apt-get install -yq nvidia-docker2
RUN sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json
RUN mkdir -p /usr/local/bin/
COPY dockerd-entrypoint.sh /usr/local/bin/
RUN chmod 777 /usr/local/bin/dockerd-entrypoint.sh
RUN ln -s /usr/local/bin/dockerd-entrypoint.sh /
VOLUME /var/lib/docker
EXPOSE 2375
ENTRYPOINT ["dockerd-entrypoint.sh"]
#ENTRYPOINT ["/bin/sh", "/shared/dockerd-entrypoint.sh"]
CMD []
当我使用 exec
登录到 Docker-in-Docker 容器时,我可以成功 运行 nvidia-smi
(之前 return not found 错误然后不能运行 任何 GPU 资源相关 docker 运行)
欢迎拉取我的镜像到brandsight/dind:nvidia-docker