使用 Conda 调试损坏的 Tensorflow-gpu 安装（1.14 vs 2.3），Ubuntu 18.04

Question

我最近弄错了我的 TF 安装，把所有东西都弄坏了。我曾经有两个 Conda 环境，分别是 TF 1.14 和 2.1，Cuda 10.1，两者都工作正常。经过大量管道工作后，我现在有了带有 TF 2.3、Cuda 10.1 的主要 Conda 环境，但是在完成所有安装库和 tensorrt 并为 TF 1.14 创建新环境（还有一些我还没有移植的旧代码）之后，什么曾经工作得很好，conda install -c (conda-forge|anaconda) tensorflow-gpu 现在看不到我的 gpu。

Sun Nov  1 09:15:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   38C    P8     6W /  N/A |     11MiB /  5944MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2719      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

/usr/local/cuda:
bin  doc  extras  include  lib64  libnsight  libnvvp  LICENSE  nsightee_plugins  nvml  nvvm  README  samples  share  src  targets  tools  version.txt

/usr/local/cuda-10.1:
bin  doc  extras  include  lib64  libnsight  libnvvp  LICENSE  nsightee_plugins  nvml  nvvm  README  samples  share  src  targets  tools  version.txt

/usr/local/cuda-10.2:
doc  lib64  LICENSE  README  targets  version.txt

/usr/local/cuda-11.1:
include  lib64  src  targets

最后是错误：

In [2]: tf.test.is_gpu_available()                                                                                                                                                     
2020-11-01 00:42:23.536860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX 
AVX2 FMA                                                                                                                                                                               
2020-11-01 00:42:23.570537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2295750000 Hz                                                                     
2020-11-01 00:42:23.571572: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557fe1bd9660 executing computations on platform Host. Devices:                             
2020-11-01 00:42:23.571626: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>                                                    
Out[2]: False

（而在我使用 TF 2.3 的其他环境中一切都很好:)

In [2]: tf.config.list_physical_devices()                                                                                                                                              
2020-11-01 09:11:18.858155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1                                           
2020-11-01 09:11:18.901461: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.901901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:                                                                   
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti with Max-Q Design computeCapability: 7.5                                                                                              
coreClock: 1.335GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s                                                                                         
2020-11-01 09:11:18.901934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1                                      
2020-11-01 09:11:18.903297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10                                        
2020-11-01 09:11:18.904777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10                                         
2020-11-01 09:11:18.905133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10                                        
2020-11-01 09:11:18.906631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10                                      
2020-11-01 09:11:18.907411: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10                                      
2020-11-01 09:11:18.910462: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7                                          
2020-11-01 09:11:18.910683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.911185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero                                                                                                                                                   
2020-11-01 09:11:18.911554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0                                                                     
Out[2]:                                                                                                                                                                                
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),                                                                                                                     
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),                                                                                                             
 PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),                                                                                                             
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

我也知道的 Conda 分布式版本，直到昨天它还在我的机器上运行，现在我重做了在我看来相同的步骤，但没有任何效果，所以可能是问题...?

有人遇到过这种情况吗？我还需要在另一台机器上解决这个问题，完全相同的问题，并且 /usr/local 中没有 cuda-11.1 这个......提前致谢！

Answer 1

因此，经过多次争论（在当今时代，想要在一台机器上安装两个版本而不是一个 TF 无疑是一种疯狂的表现），我发现可行的解决方案是：

在主要的 TF 2.3 环境中，按照 here 中描述的步骤进行操作，除了两个调整：
- 暂时不要安装 TENSORFLOW。
- 目前（2020 年 10 月）sudo apt-get install --no-install-recommends cuda-10-1 不再有效，但 conda install cudatoolkit=10.1.243 有效，请参阅 this；
- 其他警告 我还注意到 TF 2.3 无法找到整个库数组 (libcublas.so.10, libcufft.so.10, libcurand.so.10 等）直到我安装了 cuda 10.2...conda install cudatoolkit=10.2.89，我看到人们谈论它 here，所以不清楚这是完美的解决方案（其他人符号链接文件，或手动将它们从一个目录复制到另一个目录，那些地狱般的日子将被记住；
- （另一个选项，没有TensorRT，但是对于清除cuda和nvidia的东西非常有用，并且fail-safe，可以找到here）
安装所有库、cuda 等之后（此时您需要重启，您可以使用 nvidia-smi 检查您的 gpu 是否可见，创建一个新环境，并使用 anaconda 频道安装 TF 1.4（conda-forge 对我来说失败了）：conda install tensorflow-gpu=1.14.
最后，在最后，回到主环境，用 pip 安装 tensorflow。

在那里，你应该有这个：

$ conda list | grep tensop tensor
tensorboard               1.14.0           py37hf484d3e_0    anaconda
tensorflow                1.14.0          gpu_py37h74c33d7_0    anaconda
tensorflow-base           1.14.0          gpu_py37he45bfe2_0    anaconda
tensorflow-estimator      1.14.0                     py_0    anaconda
tensorflow-gpu            1.14.0               h0d30ee6_0    anaconda

而且，重要的是：

$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0

如果你事先用 pip 安装了 TF，这将不起作用。

之后，激活您的其他基础环境，并使用 pip 完成安装

$ pip install tensorflow

哪个应该给你：

$ conda list | grep tenso tensor
tensorboard               2.3.0                    pypi_0    pypi
tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
tensorflow                2.3.1                    pypi_0    pypi
tensorflow-estimator      2.3.0                    pypi_0    pypi

并且：

$ pip freeze | grep tensor
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0

使用 Conda 调试损坏的 Tensorflow-gpu 安装（1.14 vs 2.3），Ubuntu 18.04

Debug broken Tensorflow-gpu installation with Conda (1.14 vs 2.3), Ubuntu 18.04

conda

tensorflow