Docker 在 compose 文件中配置了 GPU 的服务;没有被 Keras 识别的 GPU
Docker service with GPU configured in compose file; no GPU recognized by Keras
我在 v3.5 docker 组合文件中配置了一个多服务应用程序。
其中一项服务是可以访问集群中(一个)节点上的(一个)GPU。
但是,如果我使用 docker 撰写文件启动服务,我似乎无法访问 GPU,正如 keras 所报告的那样:
import keras
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
打印
Using TensorFlow backend.
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality { }
incarnation: 10790773049987428954,
name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality { }
incarnation: 239154712796449863
physical_device_desc: "device: XLA_CPU device"]
如果我 运行 来自命令行的相同图像,如下所示:
docker run -it --rm $(ls /dev/nvidia* | xargs -I{} echo '--device={}') $(ls /usr/lib/*-linux-gnu/{libcuda,libnvidia}* | xargs -I{} echo '-v {}:{}:ro') -v $(pwd):/srv --entrypoint /bin/bash ${MY_IMG}
输出为
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3178082198631681841, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 15685155444461741733
physical_device_desc: "device: XLA_CPU device", name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 4056441191345727860
physical_device_desc: "device: XLA_GPU device"]
配置:
我已经安装了 nvidia-docker 并根据 this guide:
配置了节点
/etc/systemd/system/docker.service.d/override.conf:
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-runtime=nvidia --node-generic-resource gpu=GPU-b7ad85d5
和
/etc/nvidia-container-runtime/config.toml:
disable-require = false
swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
docker撰写文件的相关部分:
docker-compose.yaml:
version: '3.5'
...
services:
...
my-service:
...
deploy:
resources:
reservations:
generic_resources:
- discrete_resource_spec:
kind: 'gpu'
value: 1
问题:
在该 docker 服务中访问 GPU 还需要什么?
NVIDIA-Docker 仅适用于 Docker Compose 2.3
将版本更改为 version: '2.3'
https://github.com/NVIDIA/nvidia-docker/wiki#do-you-support-docker-compose.
我在 v3.5 docker 组合文件中配置了一个多服务应用程序。
其中一项服务是可以访问集群中(一个)节点上的(一个)GPU。 但是,如果我使用 docker 撰写文件启动服务,我似乎无法访问 GPU,正如 keras 所报告的那样:
import keras
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
打印
Using TensorFlow backend.
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality { }
incarnation: 10790773049987428954,
name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality { }
incarnation: 239154712796449863
physical_device_desc: "device: XLA_CPU device"]
如果我 运行 来自命令行的相同图像,如下所示:
docker run -it --rm $(ls /dev/nvidia* | xargs -I{} echo '--device={}') $(ls /usr/lib/*-linux-gnu/{libcuda,libnvidia}* | xargs -I{} echo '-v {}:{}:ro') -v $(pwd):/srv --entrypoint /bin/bash ${MY_IMG}
输出为
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3178082198631681841, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 15685155444461741733
physical_device_desc: "device: XLA_CPU device", name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 4056441191345727860
physical_device_desc: "device: XLA_GPU device"]
配置:
我已经安装了 nvidia-docker 并根据 this guide:
配置了节点/etc/systemd/system/docker.service.d/override.conf:
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-runtime=nvidia --node-generic-resource gpu=GPU-b7ad85d5
和
/etc/nvidia-container-runtime/config.toml:
disable-require = false
swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
docker撰写文件的相关部分:
docker-compose.yaml:
version: '3.5'
...
services:
...
my-service:
...
deploy:
resources:
reservations:
generic_resources:
- discrete_resource_spec:
kind: 'gpu'
value: 1
问题: 在该 docker 服务中访问 GPU 还需要什么?
NVIDIA-Docker 仅适用于 Docker Compose 2.3
将版本更改为 version: '2.3'
https://github.com/NVIDIA/nvidia-docker/wiki#do-you-support-docker-compose.