如何在 Tensorflow 多 GPU 案例中使用 feed_dict
How to use feed_dict in Tensorflow multiple GPU case
最近,我尝试学习如何在多个GPU上使用Tensorflow来加快训练速度。我找到了一个关于基于Cifar10数据集训练分类模型的官方教程。但是,我发现本教程使用队列读取图像。出于好奇,我如何通过向 Session 提供价值来使用多个 GPU?似乎我很难解决从同一数据集向不同 GPU 提供不同值的问题。谢谢大家!以下代码是关于官方教程的一部分。
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
# Calculate the gradients for the batch of data on this CIFAR tower.
grads = opt.compute_gradients(loss)
# Keep track of the gradients across all towers.
tower_grads.append(grads)
QueueRunner and Queue-based API is relatively out-dated, it is clearly mentioned in Tensorflow docs:
Input pipelines using the queue-based APIs can be cleanly
replaced by the tf.data
API
因此,建议使用tf.data
API。它针对多 GPU 和 TPU 目的进行了优化。
如何使用?
dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train))
iterator = dataset.make_one_shot_iterator()
x,y = iterator.get_next()
# define your model
logit = tf.layers.dense(x,2) # use x directrly in your model
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
train_step = tf.train.AdamOptimizer().minimize(cost)
with tf.Session() as sess:
sess.run(train_step)
您可以使用 Dataset.shard()
或更轻松地使用估算器 API.
为每个 GPU 创建多个迭代器
有关完整教程,请参阅 here。
多 GPU 示例的核心思想是将操作显式分配给 tf.device
。该示例遍历 FLAGS.num_gpus
个设备并为每个 GPU 创建一个副本。
如果您在 for 循环中创建占位符操作,它们将被分配到各自的设备。您需要做的就是保留创建的占位符的句柄,然后在单个 session.run
调用中独立地提供它们。
placeholders = []
for i in range(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
plc = tf.placeholder(tf.int32)
placeholders.append(plc)
with tf.Session() as sess:
fd = {plc: i for i, plc in enumerate(placeholders)}
sess.run(sum(placeholders), feed_dict=fd) # this should give you the sum of all
# numbers from 0 to FLAGS.num_gpus - 1
为了解决您的具体示例,用两个占位符(对于 image_batch
和 label_batch
张量)的构造替换 batch_queue.dequeue()
调用就足够了,将这些占位符存储在某处,然后将您需要的值提供给那些。
另一种(有点古怪)的方法是直接在 session.run
调用中覆盖 image_batch
和 label_batch
张量,因为您可以 feed_dict 任何张量(不仅仅是一个占位符)。您仍然需要将张量存储在某处,以便能够从 run
调用中引用它们。
最近,我尝试学习如何在多个GPU上使用Tensorflow来加快训练速度。我找到了一个关于基于Cifar10数据集训练分类模型的官方教程。但是,我发现本教程使用队列读取图像。出于好奇,我如何通过向 Session 提供价值来使用多个 GPU?似乎我很难解决从同一数据集向不同 GPU 提供不同值的问题。谢谢大家!以下代码是关于官方教程的一部分。
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
# Calculate the gradients for the batch of data on this CIFAR tower.
grads = opt.compute_gradients(loss)
# Keep track of the gradients across all towers.
tower_grads.append(grads)
QueueRunner and Queue-based API is relatively out-dated, it is clearly mentioned in Tensorflow docs:
Input pipelines using the queue-based APIs can be cleanly replaced by the
tf.data
API
因此,建议使用tf.data
API。它针对多 GPU 和 TPU 目的进行了优化。
如何使用?
dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train))
iterator = dataset.make_one_shot_iterator()
x,y = iterator.get_next()
# define your model
logit = tf.layers.dense(x,2) # use x directrly in your model
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
train_step = tf.train.AdamOptimizer().minimize(cost)
with tf.Session() as sess:
sess.run(train_step)
您可以使用 Dataset.shard()
或更轻松地使用估算器 API.
有关完整教程,请参阅 here。
多 GPU 示例的核心思想是将操作显式分配给 tf.device
。该示例遍历 FLAGS.num_gpus
个设备并为每个 GPU 创建一个副本。
如果您在 for 循环中创建占位符操作,它们将被分配到各自的设备。您需要做的就是保留创建的占位符的句柄,然后在单个 session.run
调用中独立地提供它们。
placeholders = []
for i in range(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
plc = tf.placeholder(tf.int32)
placeholders.append(plc)
with tf.Session() as sess:
fd = {plc: i for i, plc in enumerate(placeholders)}
sess.run(sum(placeholders), feed_dict=fd) # this should give you the sum of all
# numbers from 0 to FLAGS.num_gpus - 1
为了解决您的具体示例,用两个占位符(对于 image_batch
和 label_batch
张量)的构造替换 batch_queue.dequeue()
调用就足够了,将这些占位符存储在某处,然后将您需要的值提供给那些。
另一种(有点古怪)的方法是直接在 session.run
调用中覆盖 image_batch
和 label_batch
张量,因为您可以 feed_dict 任何张量(不仅仅是一个占位符)。您仍然需要将张量存储在某处,以便能够从 run
调用中引用它们。