使用输入队列的 Tensorflow 训练卡住了
Tensorflow Training using input queue gets stuck
我正在尝试构建类似于 this 教程中的神经网络训练。
我的代码如下所示:
def train():
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
step = 0
try:
while not coord.should_stop():
step += 1
print 'Training step %i' % step
training = train_op()
sess.run(training)
except tf.errors.OutOfRangeError:
print 'Done training - epoch limit reached.'
finally:
coord.request_stop()
coord.join(threads)
sess.close()
和
MIN_NUM_EXAMPLES_IN_QUEUE = 10
NUM_PRODUCING_THREADS = 1
NUM_CONSUMING_THREADS = 1
def train_op():
images, true_labels = inputs()
predictions = NET(images)
true_labels = tf.cast(true_labels, tf.float32)
loss = tf.nn.softmax_cross_entropy_with_logits(predictions, true_labels)
return OPTIMIZER.minimize(loss)
def inputs():
filenames = [os.path.join(FLAGS.train_dir, filename)
for filename in os.listdir(FLAGS.train_dir)
if os.path.isfile(os.path.join(FLAGS.train_dir, filename))]
filename_queue = tf.train.string_input_producer(filenames,
num_epochs=FLAGS.training_epochs, shuffle=True)
example_list = [_read_and_preprocess_image(filename_queue)
for _ in xrange(NUM_CONSUMING_THREADS)]
image_batch, label_batch = tf.train.shuffle_batch_join(
example_list,
batch_size=FLAGS.batch_size,
capacity=MIN_NUM_EXAMPLES_IN_QUEUE + (NUM_CONSUMING_THREADS + 2) * FLAGS.batch_size,
min_after_dequeue=MIN_NUM_EXAMPLES_IN_QUEUE)
return image_batch, label_batch
教程说
These require that you call tf.train.start_queue_runners
before running any training or inference steps, or it will hang forever.
。我正在调用 tf.train.start_queue_runners
,但是 train()
的执行仍然在第一次出现 sess.run(training)
时卡住。
有人知道我做错了什么吗?
每次尝试 运行 训练循环时,您都在重新定义网络。
请记住,TensorFlow 定义了一个执行图,然后执行它。你想在 运行 循环之外调用你的 train_op() ,你需要在调用 initialize_all_variables
和 tf.train.start_queue_runners
之前定义该图
我正在尝试构建类似于 this 教程中的神经网络训练。
我的代码如下所示:
def train():
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
step = 0
try:
while not coord.should_stop():
step += 1
print 'Training step %i' % step
training = train_op()
sess.run(training)
except tf.errors.OutOfRangeError:
print 'Done training - epoch limit reached.'
finally:
coord.request_stop()
coord.join(threads)
sess.close()
和
MIN_NUM_EXAMPLES_IN_QUEUE = 10
NUM_PRODUCING_THREADS = 1
NUM_CONSUMING_THREADS = 1
def train_op():
images, true_labels = inputs()
predictions = NET(images)
true_labels = tf.cast(true_labels, tf.float32)
loss = tf.nn.softmax_cross_entropy_with_logits(predictions, true_labels)
return OPTIMIZER.minimize(loss)
def inputs():
filenames = [os.path.join(FLAGS.train_dir, filename)
for filename in os.listdir(FLAGS.train_dir)
if os.path.isfile(os.path.join(FLAGS.train_dir, filename))]
filename_queue = tf.train.string_input_producer(filenames,
num_epochs=FLAGS.training_epochs, shuffle=True)
example_list = [_read_and_preprocess_image(filename_queue)
for _ in xrange(NUM_CONSUMING_THREADS)]
image_batch, label_batch = tf.train.shuffle_batch_join(
example_list,
batch_size=FLAGS.batch_size,
capacity=MIN_NUM_EXAMPLES_IN_QUEUE + (NUM_CONSUMING_THREADS + 2) * FLAGS.batch_size,
min_after_dequeue=MIN_NUM_EXAMPLES_IN_QUEUE)
return image_batch, label_batch
教程说
These require that you call
tf.train.start_queue_runners
before running any training or inference steps, or it will hang forever.
。我正在调用 tf.train.start_queue_runners
,但是 train()
的执行仍然在第一次出现 sess.run(training)
时卡住。
有人知道我做错了什么吗?
每次尝试 运行 训练循环时,您都在重新定义网络。
请记住,TensorFlow 定义了一个执行图,然后执行它。你想在 运行 循环之外调用你的 train_op() ,你需要在调用 initialize_all_variables
和 tf.train.start_queue_runners