无法让 VM debian 机器与 K80 一起工作

Question

我使用一些自定义张量流模型和 google 视觉 api、google nlu api 为运行项目创建了一个深度学习 VM。我用 Debian10 和 tensorflow 2.4(cuda11) 设置了一台机器，我选择了 1 个 nvidia K80 GPU。我使用这个 link 安装了 cuda11。当我运行 nvidia-smi 时，我收到这个著名的丑陋消息：

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

我尝试安装 cuda10 或任何其他的，但对于 debian 根本不存在：请参阅此 cuda 10

请问如何解决这个问题！

Answer 1

我试图在我自己的项目中重现这个错误。我安装了具有以下特征的 VM 实例：

机器类型：n1-standard-1
GPU：1 个 NVIDIA Tesla K80
启动盘：debian-10-buster-v20201216

正如您在 post 中提到的 Linux: CUDA Toolkit 10 没有驱动程序，所以我使用 link 中描述的步骤来安装它，我有一些安装驱动程序的复杂性，最后我能够重现您的问题，安装后我收到以下消息：

$ sudo nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

我又试了一次，但现在我稍微改变了我的安装：

机器类型：n1-standard-1
GPU：1 个 NVIDIA Tesla K80
启动盘：c0-common-gce-gpu-image-20200128

我这次使用的启动盘c0-common-gce-gpu-image-20200128是一个GPU优化的Debian镜像，m32（带有CUDA 10.0），一个基于Debian 9的镜像，预装了CUDA/CuDNN/NCCL

当我第一次通过ssh访问这个实例时，收到如下问题：

This VM requires Nvidia drivers to function correctly.   Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n] y
Installing Nvidia driver.

它会自动安装驱动程序。

$ sudo nvidia-smi
Thu Jan  7 19:08:06 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0    91W / 149W |      0MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

我也尝试过使用 TensorFlow 图像，因为您提到您正在使用 TensorFlow：c0-deeplearning-tf-1-15-cu110-v20201229-debian-10 根据此图像的信息，它是一个深度学习图像：TensorFlow 1.15，m61 CUDA 110，一个基于 debian-10 Linux 的图像，带有 TensorFlow 1.15（使用 CUDA 110 和 Intel(TM) MKL-DNN，英特尔® MKL) 加上英特尔(TM) 优化的 NumPy、SciPy 和 scikit-learn。在这种情况下，我 verified the TensorFlow installation too:

$ python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2021-01-07 20:29:02.854218: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
Tensor("Sum:0", shape=(), dtype=float32)

而且效果很好。

因此，安装的图像 (Devian 10) 和 GPU 类型 (NVIDIA K80) 所需的 CUDA 工具包之间似乎存在问题。

我的建议是使用深度学习 VM 映像，您可以在此处查看完整列表 link：Choosing an image

无法让 VM debian 机器与 K80 一起工作

Unable to get VM debian machine work with K80

nvidia

google-cloud-platform