无法使用 Unified Cloud AI Platform 自定义容器创建端点

Question

由于某些 VPC 限制，我不得不使用自定义容器对在 Tensorflow 上训练的模型进行预测。根据 documentation 的要求，我使用 Tensorflow Serving 创建了一个 HTTP 服务器。 build镜像使用的Dockerfile如下：

FROM tensorflow/serving:2.4.1-gpu

# copy the model file
ENV MODEL_NAME=my_model
COPY my_model /models/my_model

其中 my_model 包含 saved_model 在名为 1/ 的文件夹中。

然后我将此映像推送到 Google 容器存储库，然后使用 Import an existing custom container 并将 Port 更改为 8501 创建了一个 Model。但是在尝试时使用 n1-standard-16 类型的单个计算节点和 1 个 P100 GPU 将模型部署到端点部署运行s 出现以下错误：

Failed to create session: Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

我不知道这是怎么回事。我能够在我的本地机器上运行相同的 docker 图像，并且我能够通过点击创建的端点成功获得预测：http://localhost:8501/v1/models/my_model:predict

在这方面的任何帮助将不胜感激。

Answer 1

问题已通过将 Tensorflow serving 图像降级到 2.3.0-gpu 版本得到解决。根据错误上下文，自定义模型映像中的 CUDA 驱动程序与 GCP AI Platform 训练集群中相应的驱动程序版本不匹配。

无法使用 Unified Cloud AI Platform 自定义容器创建端点

Cannot create an Endpoint with Unified Cloud AI Platform custom containers

google-cloud-platform

tensorflow-serving

google-cloud-ml

google-ai-platform