A100 tensorflow gpu error: "Failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error"

Question

我正在尝试运行在虚拟机上 docker 支持 gpu 的 tensorflow。我尝试了很多在线解决方案，包括：

尝试了不同的 docker 张量流版本图像：2.6、2.4、1.15、1.14
根据本指南多次使用不同的 bazel 标志从容器内的源构建 tensorflow https://www.tensorflow.org/install/source 2.6 和 1.14
尝试通过这些命令使 GPU 可见：
使用了 nvidia tensorflow docker

none 的解决方案对我有用，这里有一些步骤：

我使用 nvidia-smi 和 nvcc -V 验证了驱动程序、cuda 和 cudnn 工具包已安装在容器内：

Python 版本是： Python 3.8.10

tensorflow 版本为：

import tensorflow as tf 
tf.__version__
'2.6.0'

出现以下错误： tf.config.list_physical_devices()

因此 GPU 在某种程度上对张量流不可见。所有tensorflow构建return同样的错误：

 E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error

但是例如对于 1.14 有关于 CPU 类型的附加注释：

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

GPU 是 A100，CPU 是 Intel(R) Xeon(R) Gold 6226R。

这是怎么回事？我该如何解决这个问题？

Answer 1

我发现GPU有一个多实例特性：

因此，应配置 GPU 实例：

sudo nvidia-smi mig -cgi 0 -C

然后在调用 nvidia-smi 时你会得到：

问题解决了！

python