Tensorflow multi gpu example error: Variable conv1/weights/ExponentialMovingAverage/ does not exist

Question

我是运行本教程中提到的代码：https://www.tensorflow.org/tutorials/deep_cnn/

我从这里下载代码：https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/

我是运行 ubuntu 14.04 上 AWS 的 g2.4xlarge 机器上的代码。单个 gpu 示例运行良好，没有任何错误。

有人可以帮忙解决这个问题吗？我是运行0.12版本

ubuntu@ip-xxx-xx-xx-xx:~/pythonworkspace/tensorflowdev/models-master/tutorials/image/cifar10$ python -c 'import tensorflow as tf; print(tf.version)'

0.12.head

ubuntu@ip-xxx-xx-xx-xx:~/pythonworkspace/tensorflowdev/models-master/tutorials/image/cifar10$ python cifar10_multi_gpu_train.py --num_gpus=2

>> Downloading cifar-10-binary.tar.gz 100.0%
Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
WARNING:tensorflow:From /home/ubuntu/pythonworkspace/tensorflowdev/models-master/tutorials/image/cifar10/cifar10_input.py:135: image_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.image. Note that tf.summary.image uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, the max_images argument was renamed to max_outputs.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
WARNING:tensorflow:From /home/ubuntu/pythonworkspace/tensorflowdev/models-master/tutorials/image/cifar10/cifar10_input.py:135: image_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.image. Note that tf.summary.image uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, the max_images argument was renamed to max_outputs.
Traceback (most recent call last):
  File "cifar10_multi_gpu_train.py", line 273, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "cifar10_multi_gpu_train.py", line 269, in main
    train()
  File "cifar10_multi_gpu_train.py", line 210, in train
    variables_averages_op = variable_averages.apply(tf.trainable_variables())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/moving_averages.py", line 373, in apply
    colocate_with_primary=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 110, in create_slot
    return _create_slot_var(primary, val, "")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 64, in _create_slot_var
    use_resource=_is_resource(primary))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1034, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 933, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 356, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 341, in _true_getter
    use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 671, in _get_single_variable
    "VarScope?" % name)
ValueError: Variable conv1/weights/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

Answer 1

您可以在这里找到问题的答案： Issue 6220

您需要输入：
with tf.variable_scope(tf.get_variable_scope())
在你的设备上运行的循环前面......

所以，这样做：

with tf.variable_scope(tf.get_variable_scope()):
    for i in xrange(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):

解释在link...
这里引用：

When you do tf.get_variable_scope().reuse_variables() you set the current scope to reuse variables. If you call the optimizer in such scope, it's trying to reuse slot variables, which it cannot find, so it throws an error. If you put a scope around, the tf.get_variable_scope().reuse_variables() only affects that scope, so when you exit it, you're back in the non-reusing mode, the one you want.

Hope that helps, let me know if I should clarify more.

Tensorflow multi gpu example error: Variable conv1/weights/ExponentialMovingAverage/ does not exist

Tensorflow multi gpu example error: Variable conv1/weights/ExponentialMovingAverage/ does not exist

python

multi-gpu

tensorflow

0.12.head