TPU 在具有 Kubernetes 集群的 Google 云上返回 "failed call to cuInit: UNKNOWN ERROR (303)"

Question

我正在尝试将 TPU 与 Google Cloud 的 Kubernetes 引擎一起使用。当我尝试初始化 TPU 时，我的代码 returns 有几个错误，而任何其他操作仅运行在 CPU 上。为了运行这个程序，我将一个 Python 文件从我的 Dockerhub 工作区传输到 Kubernetes，然后在单个 v2 抢占式 TPU 上执行它。 TPU 使用 Tensorflow 2.3，据我所知，这是支持 Cloud TPU 的最新版本。（当我尝试使用 Tensorflow 2.4 或 2.5 时，我收到一个错误，说该版本尚不支持）。

当我运行我的代码时，Google Cloud 看到 TPU 但无法连接到它，而是使用 CPU。它returns这个错误：

tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)

tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (resnet-tpu-fxgz7): /proc/driver/nvidia/version does not exist

tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2299995000 Hz

tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561fb2112c20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:

tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001

TPU name grpc://10.8.16.2:8470

错误似乎表明 tensorflow 需要安装 NVIDIA 软件包，但我从 Google Cloud TPU 文档中了解到我不需要为 TPU 使用 tensorflow-gpu。无论如何，我尝试使用 tensorflow-gpu 并收到相同的错误，所以我不确定如何解决这个问题。我已多次尝试删除并重新创建我的集群和 TPU，但我似乎无法取得任何进展。我是 Google Cloud 的新手，所以我可能遗漏了一些明显的东西，但我们将不胜感激任何帮助。

这是我正在尝试的 Python 脚本运行:

import tensorflow as tf
import os

import sys


# Parse the TPU name argument 
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
print("TPU name", tpu_name)


tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name)  # TPU detection

tpu_name = 'grpc://' + str(tpu.cluster_spec().as_dict()['worker'][0])

print("TPU name", tpu_name)
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

这是我的 Kubernetes 集群的 yaml 配置文件（尽管我为这个 post 包含了我的真实工作区名称和图像的占位符）：

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  template:
    metadata:
      name: test 
      annotations:
        tf-version.cloud-tpus.google.com: "2.3"
    spec:
      restartPolicy: Never
      imagePullSecrets:
        - name: regcred
      containers:
        - name:  test
          image: my_workspace/image 
          command: ["/bin/bash","-c","pip3 install cloud-tpu-client tensorflow==2.3.0 && python3 DebugTPU.py --tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)"]

          resources:
            limits:
              cloud-tpus.google.com/preemptible-v2: 8
  backoffLimit: 0

Answer 1

您提供的这个工作负载或日志中实际上没有错误。我认为可能有帮助的一些评论：

pip install tensorflow 如您所见，安装 tensorflow-gpu。默认情况下，它会尝试运行特定于 GPU 的初始化并失败 (failed call to cuInit: UNKNOWN ERROR (303))，因此它会回退到本地 CPU 执行。如果您尝试在 GPU VM 上开发，这是一个错误，但在典型的 CPU 工作负载中这无关紧要。本质上 tensorflow == tensorflow-gpu 并且没有可用的 GPU 它等同于 tensorflow-cpu 并带有额外的错误消息。安装 tensorflow-cpu 将使这些警告消失。
在此工作负载中，TPU 服务器也安装了自己的 TensorFlow 运行ning。实际上，您的本地 VM（例如您的 GKE 容器）是否具有 tensorflow-gpu 或 tensorflow-cpu 并不重要，只要它与 TPU 服务器的 TF 版本相同即可。您的工作负载已成功连接到 TPU 服务器，显示为：

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001

TPU 在具有 Kubernetes 集群的 Google 云上返回 "failed call to cuInit: UNKNOWN ERROR (303)"

TPU returning "failed call to cuInit: UNKNOWN ERROR (303)" on Google Cloud with Kubernetes Cluster

google-cloud-platform

kubernetes

tensorflow

google-cloud-tpu

tpu