在 Ubuntu 20.04 上使用不带 sudo 的 GPU 安装 Tensorflow 2.4

Question

我可以访问具有 Ubuntu 20.04 设置和 GPU 的虚拟机。系统管理员已经安装了最新的 Cuda 驱动程序，但不幸的是，这还不足以在 Tensorflow 中使用 GPU，因为当涉及到特定的 Cuda Toolkit + CuDNN 版本集时，每个版本的 TF 都可能非常挑剔。我没有 sudo 权限，所以我需要在本地安装所有内容。

nvidia-smi

returns驱动版本：465.19.01 CUDA版本：11.3

python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU');"

returns

2021-05-11 10:56:26.737279: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-05-11 10:56:26.737338: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-05-11 10:56:28.313896: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-11 10:56:28.315540: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-11 10:56:28.324232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-11 10:56:28.324707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:05.0 name: NVIDIA Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-11 10:56:28.324867: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-11 10:56:28.325293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:00:06.0 name: NVIDIA Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-11 10:56:28.325438: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325563: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325706: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325820: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325931: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326028: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326117: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326215: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326230: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

这表明 GPU 不会在 TF 应用程序中使用。

我不得不花一些时间来设置虚拟机，所以我将在下面逐步post我的解决方案。

Answer 1

在没有管理员权限的 Ubuntu 20.04 环境中设置 Tensorflow 2 的说明。4.x（针对 2.4.1 进行了测试）。假定系统管理员已经安装了最新的 Cuda 驱动程序。它包括安装 Cuda 11.0 工具包 + CuDNN 8.2.0.

下面的说明将在没有 sudo 权限的目录 /home/pherath/cuda_toolkits/cuda-11.0 下安装 CUDA 11.0（测试适用于 Tensorflow 2.4.1）。

步骤 1. 下载 CUDA 11.0

wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
chmod +x cuda_11.0.2_450.51.05_linux.run

第 2 步，选项 1：对于快速自动化表单，请使用以下内容

./cuda_11.0.2_450.51.05_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0

第 2 步，选项 2：这是一个可视化的分步指南

./cuda_11.0.2_450.51.05_linux.run

Continue, then accept the EULA.

Leave only Cuda Toolkit checked, uncheck everything else. Then go to Options.

Go into Toolkit Options.

Uncheck everything, then go to Change Toolkit Install Path and replace it with /home/pherath/cuda_toolkits/cuda-11.0 After this step, proceed with Install.

步骤 3. 下载 CUDA 11.0 补丁

wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda_11.0.3_450.51.06_linux.run
chmod +x cuda_11.0.3_450.51.06_linux.run

步骤 4. 选项 1：快速静音模式

./cuda_11.0.3_450.51.06_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0

步骤 4. 选项 2：GUI 模式 重复步骤 2、选项 2 的确切步骤。

安装可能会出错。 When checking the logs, the error I saw suggests that there might be a bug in the installation script. The only offending term is the symbolic link of one file.

[ERROR]: boost::filesystem::create_symlink: File exists: "libcuinj64.so.11.0", "/home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib/libcuinj64.so"

我在各种分发尝试中遇到了其他几个单一错误（例如，在 Ubuntu 16.04 上）：
libcuinj64.so.11.0, libaccinj64.so.11.0, libnvrtc-builtins.so.11.0

这个错误可以用下面两行修复

cd /home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib # move to the dir of the offending line
ln -s libaccinj64.so.11.0 libaccinj64.so #reorder such that symbolic link and target are in correct order (we need libaccinj64.so -> libaccinj64.so.11.0)

步骤 5. 下载 CuDNN 8.2.0

cd /home/pherath/cuda_toolkits # move back to the parent of previous dir

您需要从 CuDNN archives, I used v8.2.0. This step will require you to create an account at CuDNN and download through web interface. If you don’t have GUI on the machine you are setting up tensorflow, I suggest using "Link Redirect Trace" add-on to track the exact link the file would be downloaded from (here is a google chrome add-on link 下载 CuDNN .tgz 文件。您可以使用带有 GUI 的本地计算机跟踪 link，然后使用 wget 在 VM 上下载跟踪的 link。请注意，此跟踪 link.

的生命周期相对较短

下载后名称仍会被加密，通过

重命名回.tgz

mv $some_ambiguous_name cudnn-11.3-linux-x64-v8.2.0.53.tgz

tar -xvzf cudnn-11.3-linux-x64-v8.2.0.53.tgz # this will extract things under a dir called 'cuda'

现在我们需要复制所有lib64并包含到cuda工具包安装下相应的目录中

cp -fv cuda/lib64/*.* cuda-11.0/lib64/.
cp -fv cuda/include/*.* cuda-11.0/include/.

第 6 步。Create/append/prepend PATH 和 LD_LIBRARY_PATH 环境变量。

将以下行添加到您的 ~/.bashrc 的末尾（否则，请确保为每个 bash 扩展相应的环境变量，您将运行 TF 脚本来自).

export CUDA11=/home/pherath/cuda_toolkits/cuda-11.0
export PATH=$CUDA11/bin:$PATH
export LD_LIBRARY_PATH=$CUDA11/lib64:$CUDA11/extras/CUPTI/lib64:$LD_LIBRARY_PATH

启动新终端或

source ~/.bashrc

在每个现有终端中。

检查安装是否成功

您可以运行以下行来测试 TF 2.4.1 + profiler 是否有效：

conda create -n tf python==3.7 -y  # create a python environment
conda activate tf #activate the virtual environment (here conda)
pip install tensorflow==2.4.1 # install tf 2.4.1
python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU'); tf.profiler.experimental.start('.'); tf.profiler.experimental.stop()" # test to see if TF with GPU works

############################################# ############################

如果您想在 Ubuntu 20.04 LTS 上安装 Cuda Toolkit 10.2，则单行安装代码会相应更改（需要添加 library_path，并覆盖 gcc 版本不匹配的投诉） .

./cuda_10.2.89_440.33.01_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-10.2 --librarypath=/home/pherath/cuda_toolkits/cuda-10.2 --override

请记住，您还需要为 cuda 工具包 10.2 的补丁重复此过程。之后，您需要下载相应的 cuDNN 并将 lib64 和 include 复制到 cuda 工具包的目录中（与上述说明相同）。

############################################# ############################

如果仍然出现错误，很可能是您没有安装正确的 cuda/nvidia 驱动程序。要解决此问题，您需要 sudo 权限！

1.

首先，清除所有 cuda/nvidia 内容（由于声誉有限，我无法添加参考..）；基本上运行下面的行具有 sudo 权限。 apt clean; apt update; apt purge cuda; apt purge nvidia-*; apt autoremove; apt install cuda

2.

按照 google https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#ubuntu-driver-steps

的说明进行操作

3.

重启机器。

在 Ubuntu 20.04 上使用不带 sudo 的 GPU 安装 Tensorflow 2.4

Setup Tensorflow 2.4 on Ubuntu 20.04 with GPU without sudo

python

ubuntu

tensorflow

ubuntu-20.04

1.

2.

3.