如何在tensorflow中自动select空闲GPU进行模型训练？

Question

我正在使用 nvidia 预建 docker 容器 NVIDIA Release 20.12-tf2 来运行我的实验。我正在使用 TensorFlow Version 2.3.1。目前，我运行在其中一个 GPU 上构建我的模型，我还有 3 个空闲的 GPU，所以我打算在任何空闲的 GPU 上使用我的替代实验。这是 nvidia-smi 的输出：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:6A:00.0 Off |                    0 |
| N/A   70C    P0    71W /  70W |  14586MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:6B:00.0 Off |                    0 |
| N/A   39C    P0    27W /  70W |    212MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:6C:00.0 Off |                    0 |
| N/A   41C    P0    28W /  70W |    212MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:6D:00.0 Off |                    0 |
| N/A   41C    P0    28W /  70W |    212MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

更新：预建容器:

我正在使用 nvidia-prebuilt 容器如下：

docker run -ti --rm --gpus all --shm-size=1024m -v /home/hamilton/data:/data nvcr.io/nvidia/tensorflow:20.12-tf2-py3

为了将空闲 GPU 用于我的其他实验，我尝试将它们添加到我的 python 脚本中：

尝试 1

import tensorflow as tf

devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)

但是这次尝试给了我以下错误：

raise ValueError("Memory growth cannot differ between GPU devices") ValueError: Memory growth cannot differ between GPU devices

我用谷歌搜索了这个错误，但 GitHub 上讨论的 none 对我不起作用。

attempt-2

我也试过这个：

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
  tf.config.experimental.set_memory_growth(gpu, True)

但是这次尝试也给了我这样的错误：

Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.

人们在 github 上讨论了这个错误，但仍然无法消除我这边的错误。

最近一次尝试:

我还尝试使用 TensorFlow 进行并行训练并将其添加到我的 python 脚本中：

device_type = "GPU"
devices = tf.config.experimental.list_physical_devices(device_type)
devices_names = [d.name.split("e:")[1] for d in devices]
strategy = tf.distribute.MirroredStrategy(devices=devices_names[:3])

with strategy.scope():
    opt = Adam(learning_rate=0.1)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

但这也给了我错误，程序停止了。

任何人都可以帮助我如何在 tensorflow 中为训练模型自动 select 闲置 GPU？有谁知道任何可行的方法？我的上述尝试有什么问题？运行在其中一个 GPU 上运行程序时，有没有利用闲置 GPU 的想法？有什么想法吗？

Answer 1

感谢@HernánAlarcón 的建议，我这样试过并且效果很好：

docker run -ti --rm --gpus device=1,3 --shm-size=1024m -v /home/hamilton/data:/data nvcr.io/nvidia/tensorflow:20.12-tf2-py3

这可能不是一个优雅的解决方案，但它很有魅力。我愿意接受其他可能的补救措施来解决此类问题。

如何在tensorflow中自动select空闲GPU进行模型训练？

How to automatically select idle GPU for model traning in tensorflow?

python

gpu

tensorflow

nvidia-docker