共享 GPU 上的 Tensorflow：如何自动 select 未使用的

Question

我可以通过 ssh 访问一个由 n 个 GPU 组成的集群。 Tensorflow 自动给它们命名 gpu:0,...,gpu:(n-1).

其他人也可以访问，有时他们会使用随机 GPU。我没有明确放置任何 tf.device()，因为这很麻烦，即使我选择了 j 号 gpu 并且有人已经在 j 号 gpu 上，这也会有问题。

我想查看 gpus 的使用情况，找到第一个未使用的，然后只使用这个。我想有人可以用 bash 解析 nvidia-smi 的输出并得到一个变量 i 并将该变量 i 作为要使用的 gpu 的数量提供给 tensorflow 脚本。

我从未见过这样的例子。我想这是一个很常见的问题。最简单的方法是什么？可以使用纯 tensorflow 吗？

Answer 1

我不知道 pure-TensorFlow 解决方案。问题是 TensorFlow 配置的现有位置是会话配置。然而，对于 GPU 内存，一个 GPU 内存池为进程内的所有 TensorFlow 会话共享，因此会话配置将是错误的添加位置，并且没有 process-global 配置的机制（但应该有，以也可以配置 process-global Eigen 线程池）。因此，您需要使用 CUDA_VISIBLE_DEVICES 环境变量在流程级别上进行操作。

像这样：

import subprocess, re

# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23

def run_command(cmd):
    """Run command, return output as string."""
    output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
    return output.decode("ascii")

def list_available_gpus():
    """Returns list of available GPU ids."""
    output = run_command("nvidia-smi -L")
    # lines of the form GPU 0: TITAN X
    gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
    result = []
    for line in output.strip().split("\n"):
        m = gpu_regex.match(line)
        assert m, "Couldnt parse "+line
        result.append(int(m.group("gpu_id")))
    return result

def gpu_memory_map():
    """Returns map of GPU id to memory allocated on that GPU."""

    output = run_command("nvidia-smi")
    gpu_output = output[output.find("GPU Memory"):]
    # lines of the form
    # |    0      8734    C   python                                       11705MiB |
    memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
    rows = gpu_output.split("\n")
    result = {gpu_id: 0 for gpu_id in list_available_gpus()}
    for row in gpu_output.split("\n"):
        m = memory_regex.search(row)
        if not m:
            continue
        gpu_id = int(m.group("gpu_id"))
        gpu_memory = int(m.group("gpu_memory"))
        result[gpu_id] += gpu_memory
    return result

def pick_gpu_lowest_memory():
    """Returns GPU with the least allocated memory"""

    memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
    best_memory, best_gpu = sorted(memory_gpu_map)[0]
    return best_gpu

然后您可以将其放入 utils.py 并在首次 tensorflow 导入之前在您的 TensorFlow 脚本中设置 GPU。浏览器

import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow

Answer 2

Yaroslav Bulatov 解决方案的实施可在 https://github.com/bamos/setGPU 上获得。

共享 GPU 上的 Tensorflow：如何自动 select 未使用的

Tensorflow on shared GPUs: how to automatically select the one that is unused

gpu

distributed-system

tensorflow