无法在带有 Docker 驱动程序的 Minikube 上使用 GPU

Cannot use GPU on Minikube with Docker driver

目标:

我正在尝试在使用默认 Docker 驱动程序的 Minikube 集群上使用 Nvidia GPU 功能。

问题:

我可以在默认 docker 上下文中使用 nvidia-docker,但是当切换到 minikube docker-env 时,我收到以下错误:

$ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled

环境:

$ docker version
Client: Docker Engine - Community
 Version:           19.03.10
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        9424aeaee9
 Built:             Thu May 28 22:16:49 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.9
  Git commit:       6a30dfca03
  Built:            Wed Sep 11 22:45:55 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.3-14-g449e9269
  GitCommit:        449e926990f8539fd00844b26c07e2f1e306c760
 runc:
  Version:          1.0.0-rc10
  GitCommit:        
 docker-init:
  Version:          0.18.0
  GitCommit:
$ nvidia-container-runtime --version
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

附加信息:

集群创建于:

minikube start --cpus 3 --memory 8G

当前启用了以下 minikube 个插件:

$ minikube addons list
|-----------------------------|----------|--------------|
|         ADDON NAME          | PROFILE  |    STATUS    |
|-----------------------------|----------|--------------|
| dashboard                   | minikube | disabled     |
| default-storageclass        | minikube | enabled ✅    |
| efk                         | minikube | disabled     |
| freshpod                    | minikube | disabled     |
| gvisor                      | minikube | disabled     |
| helm-tiller                 | minikube | disabled     |
| ingress                     | minikube | disabled     |
| ingress-dns                 | minikube | disabled     |
| istio                       | minikube | disabled     |
| istio-provisioner           | minikube | disabled     |
| logviewer                   | minikube | disabled     |
| metallb                     | minikube | disabled     |
| metrics-server              | minikube | disabled     |
| nvidia-driver-installer     | minikube | enabled ✅    |
| nvidia-gpu-device-plugin    | minikube | enabled ✅    |
| registry                    | minikube | disabled     |
| registry-aliases            | minikube | disabled     |
| registry-creds              | minikube | disabled     |
| storage-provisioner         | minikube | enabled ✅    |
| storage-provisioner-gluster | minikube | disabled     |
|-----------------------------|----------|--------------|

这是 minikube 上下文之外的工作示例:

$ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Fri Jun  5 09:23:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   51C    P8     6W / 120W |   1293MiB /  6077MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

这是一个社区维基答案。如果需要,请随意编辑和扩展它。

Nvidia GPU 不受 Minikube 的 docker 驱动程序的正式支持。这给您留下了两个可能的选择:

  1. 尝试使用NVIDIA Container Toolkit and NVIDIA device plugin。这是一种解决方法,可能不是您用例中的最佳解决方案。

  2. 使用KVM2 driver or None driver。这两个得到官方支持和记录。

希望对您有所帮助。