在 TF slim 中为 inception_v3 模型使用多个 GPU

Question

我正在尝试使用 3 个 GPU 训练 slim model。

我特意告诉TF使用第二个GPU来分配模型：

with tf.device('device:GPU:1'):
    logits, end_points = inception_v3(inputs)

但是，每次我运行我的代码时，我都会在该 GPU 上遇到 OOM 错误。我试图减少 batch_size 以便模型适合内存，但网络已损坏。

我有 3 个 GPU，那么，有没有办法告诉 TF 在第二个 GPU 已满时使用我的第三个 GPU？我试过不告诉 TF 使用任何 GPU 并允许软放置，但它也不起作用。

Answer 1

这条语句with tf.device('device:GPU:1')专门告诉tensorflow使用GPU-1，所以它不会尝试使用你拥有的任何其他设备。

当模型太大时，recommended way 是通过手动将图形拆分到不同的 GPU 来使用模型并行性。你的情况的复杂性是模型定义在库中，所以你不能插入 tf.device 不同层的语句，除非你修补 tensorflow。

但有一个解决方法

您可以在调用 inception_v3 构建器之前 定义和放置变量。这样 inception_v3 将重用这些变量而不改变其位置。示例：

with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE): with tf.device('device:GPU:1'): tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/biases", shape=[1000]) tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/weights", shape=[1, 1, 2048, 1000]) with tf.device('device:GPU:0'): logits, end_points = inception_v3(inputs)

在运行上，你会看到所有变量除了 Conv2d_1c_1x1 都放在 GPU-0 上，而 Conv2d_1c_1x1 层是在 GPU-1 上。

缺点是您需要知道要替换的每个变量的形状。但这是可行的，至少可以得到你的模型运行.

在 TF slim 中为 inception_v3 模型使用多个 GPU

Use multiple GPUs for inception_v3 model in TF slim

python

distributed-computing

multi-gpu

deep-learning

tensorflow