使用 Conda 调试损坏的 Tensorflow-gpu 安装(1.14 vs 2.3),Ubuntu 18.04
Debug broken Tensorflow-gpu installation with Conda (1.14 vs 2.3), Ubuntu 18.04
我最近弄错了我的 TF 安装,把所有东西都弄坏了。我曾经有两个 Conda 环境,分别是 TF 1.14 和 2.1,Cuda 10.1,两者都工作正常。经过大量管道工作后,我现在有了带有 TF 2.3、Cuda 10.1 的主要 Conda 环境,但是在完成所有安装库和 tensorrt 并为 TF 1.14 创建新环境(还有一些我还没有移植的旧代码)之后,什么曾经工作得很好,conda install -c (conda-forge|anaconda) tensorflow-gpu
现在看不到我的 gpu。
Sun Nov 1 09:15:15 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... On | 00000000:01:00.0 Off | N/A |
| N/A 38C P8 6W / N/A | 11MiB / 5944MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1469 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2719 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/usr/local/cuda:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.1:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.2:
doc lib64 LICENSE README targets version.txt
/usr/local/cuda-11.1:
include lib64 src targets
最后是错误:
In [2]: tf.test.is_gpu_available()
2020-11-01 00:42:23.536860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
AVX2 FMA
2020-11-01 00:42:23.570537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2295750000 Hz
2020-11-01 00:42:23.571572: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557fe1bd9660 executing computations on platform Host. Devices:
2020-11-01 00:42:23.571626: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
Out[2]: False
(而在我使用 TF 2.3 的其他环境中一切都很好:)
In [2]: tf.config.list_physical_devices()
2020-11-01 09:11:18.858155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-01 09:11:18.901461: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.901901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti with Max-Q Design computeCapability: 7.5
coreClock: 1.335GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s
2020-11-01 09:11:18.901934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-01 09:11:18.903297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-11-01 09:11:18.904777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-11-01 09:11:18.905133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-11-01 09:11:18.906631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-11-01 09:11:18.907411: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-11-01 09:11:18.910462: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-11-01 09:11:18.910683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Out[2]:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
我也知道 的 Conda 分布式版本,直到昨天它还在我的机器上运行,现在我重做了在我看来相同的步骤,但没有任何效果,所以可能是问题...?
有人遇到过这种情况吗?我还需要在另一台机器上解决这个问题,完全相同的问题,并且 /usr/local
中没有 cuda-11.1
这个......提前致谢!
因此,经过多次争论(在当今时代,想要在一台机器上安装两个版本而不是一个 TF 无疑是一种疯狂的表现),我发现可行的解决方案是:
- 在主要的 TF 2.3 环境中,按照 here 中描述的步骤进行操作,除了两个调整:
- 暂时不要安装 TENSORFLOW。
- 目前(2020 年 10 月)
sudo apt-get install --no-install-recommends cuda-10-1
不再有效,但 conda install cudatoolkit=10.1.243
有效,请参阅 this;
- 其他警告 我还注意到 TF 2.3 无法找到整个库数组 (libcublas.so.10, libcufft.so.10, libcurand.so.10 等)直到我安装了 cuda 10.2...
conda install cudatoolkit=10.2.89
,我看到人们谈论它 here,所以不清楚这是完美的解决方案(其他人符号链接文件,或手动将它们从一个目录复制到另一个目录,那些地狱般的日子将被记住;
- (另一个选项,没有TensorRT,但是对于清除cuda和nvidia的东西非常有用,并且fail-safe,可以找到here)
- 安装所有库、cuda 等之后(此时您需要重启,您可以使用
nvidia-smi
检查您的 gpu 是否可见,创建一个新环境,并使用 anaconda 频道安装 TF 1.4(conda-forge 对我来说失败了):conda install tensorflow-gpu=1.14
.
- 最后,在最后,回到主环境,用 pip 安装 tensorflow。
在那里,你应该有这个:
$ conda list | grep tensop tensor
tensorboard 1.14.0 py37hf484d3e_0 anaconda
tensorflow 1.14.0 gpu_py37h74c33d7_0 anaconda
tensorflow-base 1.14.0 gpu_py37he45bfe2_0 anaconda
tensorflow-estimator 1.14.0 py_0 anaconda
tensorflow-gpu 1.14.0 h0d30ee6_0 anaconda
而且,重要的是:
$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
如果你事先用 pip 安装了 TF,这将不起作用。
之后,激活您的其他基础环境,并使用 pip 完成安装
$ pip install tensorflow
哪个应该给你:
$ conda list | grep tenso tensor
tensorboard 2.3.0 pypi_0 pypi
tensorboard-plugin-wit 1.7.0 pypi_0 pypi
tensorflow 2.3.1 pypi_0 pypi
tensorflow-estimator 2.3.0 pypi_0 pypi
并且:
$ pip freeze | grep tensor
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0
我最近弄错了我的 TF 安装,把所有东西都弄坏了。我曾经有两个 Conda 环境,分别是 TF 1.14 和 2.1,Cuda 10.1,两者都工作正常。经过大量管道工作后,我现在有了带有 TF 2.3、Cuda 10.1 的主要 Conda 环境,但是在完成所有安装库和 tensorrt 并为 TF 1.14 创建新环境(还有一些我还没有移植的旧代码)之后,什么曾经工作得很好,conda install -c (conda-forge|anaconda) tensorflow-gpu
现在看不到我的 gpu。
Sun Nov 1 09:15:15 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... On | 00000000:01:00.0 Off | N/A |
| N/A 38C P8 6W / N/A | 11MiB / 5944MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1469 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2719 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/usr/local/cuda:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.1:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.2:
doc lib64 LICENSE README targets version.txt
/usr/local/cuda-11.1:
include lib64 src targets
最后是错误:
In [2]: tf.test.is_gpu_available()
2020-11-01 00:42:23.536860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
AVX2 FMA
2020-11-01 00:42:23.570537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2295750000 Hz
2020-11-01 00:42:23.571572: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557fe1bd9660 executing computations on platform Host. Devices:
2020-11-01 00:42:23.571626: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
Out[2]: False
(而在我使用 TF 2.3 的其他环境中一切都很好:)
In [2]: tf.config.list_physical_devices()
2020-11-01 09:11:18.858155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-01 09:11:18.901461: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.901901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti with Max-Q Design computeCapability: 7.5
coreClock: 1.335GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s
2020-11-01 09:11:18.901934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-01 09:11:18.903297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-11-01 09:11:18.904777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-11-01 09:11:18.905133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-11-01 09:11:18.906631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-11-01 09:11:18.907411: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-11-01 09:11:18.910462: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-11-01 09:11:18.910683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Out[2]:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
我也知道
有人遇到过这种情况吗?我还需要在另一台机器上解决这个问题,完全相同的问题,并且 /usr/local
中没有 cuda-11.1
这个......提前致谢!
因此,经过多次争论(在当今时代,想要在一台机器上安装两个版本而不是一个 TF 无疑是一种疯狂的表现),我发现可行的解决方案是:
- 在主要的 TF 2.3 环境中,按照 here 中描述的步骤进行操作,除了两个调整:
- 暂时不要安装 TENSORFLOW。
- 目前(2020 年 10 月)
sudo apt-get install --no-install-recommends cuda-10-1
不再有效,但conda install cudatoolkit=10.1.243
有效,请参阅 this; - 其他警告 我还注意到 TF 2.3 无法找到整个库数组 (libcublas.so.10, libcufft.so.10, libcurand.so.10 等)直到我安装了 cuda 10.2...
conda install cudatoolkit=10.2.89
,我看到人们谈论它 here,所以不清楚这是完美的解决方案(其他人符号链接文件,或手动将它们从一个目录复制到另一个目录,那些地狱般的日子将被记住; - (另一个选项,没有TensorRT,但是对于清除cuda和nvidia的东西非常有用,并且fail-safe,可以找到here)
- 安装所有库、cuda 等之后(此时您需要重启,您可以使用
nvidia-smi
检查您的 gpu 是否可见,创建一个新环境,并使用 anaconda 频道安装 TF 1.4(conda-forge 对我来说失败了):conda install tensorflow-gpu=1.14
. - 最后,在最后,回到主环境,用 pip 安装 tensorflow。
在那里,你应该有这个:
$ conda list | grep tensop tensor
tensorboard 1.14.0 py37hf484d3e_0 anaconda
tensorflow 1.14.0 gpu_py37h74c33d7_0 anaconda
tensorflow-base 1.14.0 gpu_py37he45bfe2_0 anaconda
tensorflow-estimator 1.14.0 py_0 anaconda
tensorflow-gpu 1.14.0 h0d30ee6_0 anaconda
而且,重要的是:
$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
如果你事先用 pip 安装了 TF,这将不起作用。
之后,激活您的其他基础环境,并使用 pip 完成安装
$ pip install tensorflow
哪个应该给你:
$ conda list | grep tenso tensor
tensorboard 2.3.0 pypi_0 pypi
tensorboard-plugin-wit 1.7.0 pypi_0 pypi
tensorflow 2.3.1 pypi_0 pypi
tensorflow-estimator 2.3.0 pypi_0 pypi
并且:
$ pip freeze | grep tensor
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0