tensorflow/stream_executor/cuda/cuda_driver.cc:328] 调用 cuInit 失败:CUDA_ERROR_UNKNOWN:未知错误
tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
我正在尝试将 GPU 与 Tensorflow 结合使用。我的 Tensorflow 版本是 2.4.1
,我使用的是 Cuda 版本 11.2。这是 nvidia-smi
.
的输出
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce MX110 Off | 00000000:01:00.0 Off | N/A |
| N/A 52C P0 N/A / N/A | 254MiB / 2004MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1151 G /usr/lib/xorg/Xorg 37MiB |
| 0 N/A N/A 1654 G /usr/lib/xorg/Xorg 136MiB |
| 0 N/A N/A 1830 G /usr/bin/gnome-shell 68MiB |
| 0 N/A N/A 5443 G /usr/lib/firefox/firefox 0MiB |
| 0 N/A N/A 5659 G /usr/lib/firefox/firefox 0MiB |
+-----------------------------------------------------------------------------+
我遇到了一个奇怪的问题。以前,当我尝试使用 tf.config.list_physical_devices()
列出所有物理设备时,它识别了一个 cpu 和一个 gpu。之后我尝试在 GPU 上做一个简单的矩阵乘法。它失败并出现此错误:failed to synchronize cuda stream CUDA_LAUNCH_ERROR
(错误代码是这样的,我忘了记录它)。但在那之后,当我再次从另一个终端尝试同样的事情时,它无法识别任何 GPU。这次,列出物理设备会产生以下结果:
>>> tf.config.list_physical_devices()
2021-04-11 18:56:47.504776: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-11 18:56:47.507646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-11 18:56:47.534189: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-04-11 18:56:47.534233: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534244: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534356: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.39.0
2021-04-11 18:56:47.534393: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.39.0
2021-04-11 18:56:47.534404: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.39.0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
我的 OS 是 Ubuntu 20.04,Python 版本 3.8.5 和 Tensorflow,如前所述 2.4.1 和 Cuda 版本 11.2。我根据 these 说明安装了 cuda。一条附加信息;当我导入 tensorflow 时,它显示以下输出:
import tensorflow as tf
2021-04-11 18:56:07.716683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
我错过了什么?为什么之前能识别GPU却无法识别?
tldr:在安装 Nvidia 驱动程序之前禁用安全启动。
我有 exact 相同的错误,我花了很多时间试图弄清楚我是否错误地安装了 Tensorflow 相关的东西。经过数小时的问题解决后,我发现我的 NVIDIA 驱动程序出现了一些问题,因为我在设置 Ubuntu 20.4 时从未在 BIOS 中禁用安全启动。这是我的建议(我选择使用 Docker w/ Tensorflow,这样可以避免安装所有与 Cuda 相关的东西)——我希望它对你有用!
- 在您的 BIOS 中禁用安全启动
- 在 Ubuntu 20.4
上进行全新安装
- 根据nvidia-container-toolkit's page安装Docker。
curl https://get.docker.com | sh \
&& sudo systemctl --now enable docker
- 从同一页面安装
nvidia-container-toolkit
。
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
- 测试以确保它与
一起工作
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
- 最后,在 Docker 支持 GPU 的情况下使用 Tensorflow!
docker run --gpus all -u $(id -u):$(id -g) -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --ip=0.0.0.0
我刚刚创建了一个帐户,表示@Nate 的回答对我有用。
我的设置和你一模一样,试了两天
我最后做的是
重新启动 - F10 到设置 - 安全 - BIOS 安全启动(或类似的东西,我记不太清了) - 禁用
然后有一些额外的确认步骤,但效果很好。我没有重新安装整个 Unbuntu。这对我来说在技术上有点太冒险了。
然后我尝试了 tf.config 行,我得到了这个:
2021-06-14 17:12:19.546509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-06-14 17:12:26.754680: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-14 17:12:26.909679: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3593460000 Hz
2021-06-14 17:12:26.910016: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a8352501c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-14 17:12:26.910040: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-06-14 17:12:26.972350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-06-14 17:12:27.074861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-14 17:12:27.075289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:0c:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.665GHz coreCount: 14 deviceMemorySize: 3.81GiB deviceMemoryBandwidth: 119.24GiB/s
最后设备属性上有更多红线,但我得到了
Default GPU Device: /device:GPU:0
不知道为什么会这样,但确实有效。只需更改安全启动设置。
我没有足够的经验值来支持 Nate 的回答。我晚点回来。但是 he/she 确实提供了一个很好的解决方案。
禁用安全启动立即解决。无需重新安装任何东西。
> import tensorflow as tf
> tf.config.list_physical_devices("GPU")
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
我正在尝试将 GPU 与 Tensorflow 结合使用。我的 Tensorflow 版本是 2.4.1
,我使用的是 Cuda 版本 11.2。这是 nvidia-smi
.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce MX110 Off | 00000000:01:00.0 Off | N/A |
| N/A 52C P0 N/A / N/A | 254MiB / 2004MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1151 G /usr/lib/xorg/Xorg 37MiB |
| 0 N/A N/A 1654 G /usr/lib/xorg/Xorg 136MiB |
| 0 N/A N/A 1830 G /usr/bin/gnome-shell 68MiB |
| 0 N/A N/A 5443 G /usr/lib/firefox/firefox 0MiB |
| 0 N/A N/A 5659 G /usr/lib/firefox/firefox 0MiB |
+-----------------------------------------------------------------------------+
我遇到了一个奇怪的问题。以前,当我尝试使用 tf.config.list_physical_devices()
列出所有物理设备时,它识别了一个 cpu 和一个 gpu。之后我尝试在 GPU 上做一个简单的矩阵乘法。它失败并出现此错误:failed to synchronize cuda stream CUDA_LAUNCH_ERROR
(错误代码是这样的,我忘了记录它)。但在那之后,当我再次从另一个终端尝试同样的事情时,它无法识别任何 GPU。这次,列出物理设备会产生以下结果:
>>> tf.config.list_physical_devices()
2021-04-11 18:56:47.504776: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-11 18:56:47.507646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-11 18:56:47.534189: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-04-11 18:56:47.534233: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534244: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534356: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.39.0
2021-04-11 18:56:47.534393: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.39.0
2021-04-11 18:56:47.534404: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.39.0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
我的 OS 是 Ubuntu 20.04,Python 版本 3.8.5 和 Tensorflow,如前所述 2.4.1 和 Cuda 版本 11.2。我根据 these 说明安装了 cuda。一条附加信息;当我导入 tensorflow 时,它显示以下输出:
import tensorflow as tf
2021-04-11 18:56:07.716683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
我错过了什么?为什么之前能识别GPU却无法识别?
tldr:在安装 Nvidia 驱动程序之前禁用安全启动。
我有 exact 相同的错误,我花了很多时间试图弄清楚我是否错误地安装了 Tensorflow 相关的东西。经过数小时的问题解决后,我发现我的 NVIDIA 驱动程序出现了一些问题,因为我在设置 Ubuntu 20.4 时从未在 BIOS 中禁用安全启动。这是我的建议(我选择使用 Docker w/ Tensorflow,这样可以避免安装所有与 Cuda 相关的东西)——我希望它对你有用!
- 在您的 BIOS 中禁用安全启动
- 在 Ubuntu 20.4 上进行全新安装
- 根据nvidia-container-toolkit's page安装Docker。
curl https://get.docker.com | sh \
&& sudo systemctl --now enable docker
- 从同一页面安装
nvidia-container-toolkit
。
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
- 测试以确保它与 一起工作
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
- 最后,在 Docker 支持 GPU 的情况下使用 Tensorflow!
docker run --gpus all -u $(id -u):$(id -g) -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --ip=0.0.0.0
我刚刚创建了一个帐户,表示@Nate 的回答对我有用。 我的设置和你一模一样,试了两天
我最后做的是
重新启动 - F10 到设置 - 安全 - BIOS 安全启动(或类似的东西,我记不太清了) - 禁用
然后有一些额外的确认步骤,但效果很好。我没有重新安装整个 Unbuntu。这对我来说在技术上有点太冒险了。
然后我尝试了 tf.config 行,我得到了这个:
2021-06-14 17:12:19.546509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-06-14 17:12:26.754680: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-14 17:12:26.909679: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3593460000 Hz
2021-06-14 17:12:26.910016: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a8352501c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-14 17:12:26.910040: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-06-14 17:12:26.972350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-06-14 17:12:27.074861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-14 17:12:27.075289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:0c:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.665GHz coreCount: 14 deviceMemorySize: 3.81GiB deviceMemoryBandwidth: 119.24GiB/s
最后设备属性上有更多红线,但我得到了
Default GPU Device: /device:GPU:0
不知道为什么会这样,但确实有效。只需更改安全启动设置。
我没有足够的经验值来支持 Nate 的回答。我晚点回来。但是 he/she 确实提供了一个很好的解决方案。
禁用安全启动立即解决。无需重新安装任何东西。
> import tensorflow as tf
> tf.config.list_physical_devices("GPU")
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]