无法在 Google 带有自定义容器的 AI Platform 上加载动态库 libcuda.so.1 错误

Question

我正在尝试使用自定义容器在 Google AI Platform 上启动训练作业。因为我想使用 GPU 进行训练，所以我用于容器的基础图像是：

FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04

有了这张图片（以及上面安装的 tensorflow 2.4.1），我以为我可以在 AI Platform 上使用 GPU，但似乎并非如此。训练开始时，日志显示如下：

W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6): /proc/driver/nvidia/version does not exist
I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.

这是构建图像以在 Google AI 平台上使用 GPU 的好方法吗？或者我应该尝试依赖张量流图像并手动安装所有需要的驱动程序来利用 GPU？

编辑：我在这里 (https://cloud.google.com/ai-platform/training/docs/containers-overview) 阅读了以下内容：

For training with GPUs, your custom container needs to meet a few
special requirements. You must build a different Docker image than     
what you'd use for training with CPUs.

Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the 
nvidia/cuda image as your base image is the recommended way to handle 
this. It has the matching versions of CUDA toolkit and cuDNN pre-
installed, and it helps you set up the related environment variables 
correctly.

Install your training application, along with your required ML     
framework and other dependencies in your Docker image.

他们还提供了一个 Dockerfile 示例 here 用于使用 GPU 进行训练。所以我所做的似乎没问题。不幸的是，我仍然有上面提到的这些错误可以解释（或不能）为什么我不能在 Google AI 平台上使用 GPU。

EDIT2：正如此处所读 (https://www.tensorflow.org/install/gpu) 我的 Dockerfile 现在是：

FROM tensorflow/tensorflow:2.4.1-gpu
RUN apt-get update && apt-get install -y \
 lsb-release \
 vim \
 curl \
 git \
 libgl1-mesa-dev \
 software-properties-common \
 wget && \
 rm -rf /var/lib/apt/lists/*

# Add NVIDIA package repositories
RUN wget -nv https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
RUN apt-get update

RUN wget -nv http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

RUN apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
RUN apt-get update

# Install NVIDIA driver
RUN apt-get install -y --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi

RUN wget -nv https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt-get update

# Install development and runtime libraries (~4GB)
RUN apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0


# other stuff

问题是构建在似乎是键盘配置的阶段冻结。系统询问 select 一个国家，当我输入号码时，没有任何反应

Answer 1

构建最可靠容器的建议方法是使用官方维护的'Deep Learning Containers'。我建议拉 'gcr.io/deeplearning-platform-release/tf2-gpu.2-4'。这应该已经安装并测试了 CUDA、CUDNN、GPU 驱动程序和 TF 2.4。您只需将代码添加到其中。

无法在 Google 带有自定义容器的 AI Platform 上加载动态库 libcuda.so.1 错误

Could not load dynamic library libcuda.so.1 error on Google AI Platform with custom container

cuda

nvidia

docker

tensorflow

google-ai-platform