使用 tf.estimator 和 tf.data 时出现序列结束错误
End of Sequence Error when using tf.estimator and tf.data
我正在使用 tf.estimator.train_and_evaluate
和 tf.data.Dataset
向估算器提供数据:
输入数据函数:
def data_fn(data_dict, batch_size, mode, num_epochs=10):
dataset = {}
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = tf.data.Dataset.from_tensor_slices(data_dict['train_data'].astype(np.float32))
dataset = dataset.cache()
dataset = dataset.shuffle(buffer_size= batch_size * 10).repeat(num_epochs).batch(batch_size)
else:
dataset = tf.data.Dataset.from_tensor_slices(data_dict['valid_data'].astype(np.float32))
dataset = dataset.cache()
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
return next_element
训练函数:
def train_model(data):
tf.logging.set_verbosity(tf.logging.INFO)
config = tf.ConfigProto(allow_soft_placement=True,
log_device_placement=False)
config.gpu_options.allow_growth = True
run_config = tf.contrib.learn.RunConfig(
save_checkpoints_steps=10,
keep_checkpoint_max=10,
session_config=config
)
train_input = lambda: data_fn(data, 100, tf.estimator.ModeKeys.TRAIN, num_epochs=1)
eval_input = lambda: data_fn(data, 1000, tf.estimator.ModeKeys.EVAL)
estimator = tf.estimator.Estimator(model_fn=model_fn, params=hps, config=run_config)
train_spec = tf.estimator.TrainSpec(train_input, max_steps=100)
eval_spec = tf.estimator.EvalSpec(eval_input,
steps=None,
throttle_secs = 30)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
训练进行得很好,但是在评估时我得到了这个错误:
OutOfRangeError (see above for traceback): End of sequence
如果我不在评估数据集上使用 Dataset.batch
(通过省略 data_fn
中的 dataset[name] = dataset[name].batch(batch_size)
行),我会得到同样的错误,但时间要长得多。
如果我不对数据进行批处理并使用 steps=1
进行评估,我只能避免此错误,但是否对整个数据集执行评估?
我不明白导致此错误的原因,因为文档表明我也应该能够对批次进行评估。
注意:在数据批处理上使用 tf.estimator.evaluate
时出现同样的错误。
我将此问题作为 github 问题发布,这是 Tensorflow 团队的回复:
https://github.com/tensorflow/tensorflow/issues/19541
为了完整性从“xiejw”复制:
If I understand correctly, this issue is "once give estimator an input_fn with dataset inside, the evaluate process will error out with OutOfRangeError."
Estimator can handle this correctly actually. However, a known common root cause for this is metrics defined in model_fn have bug. We need to rule that part out first.
@mrezak if possible, can you show the code about the model_fn? Or if you have a minimal reproducible script, that will be extremely helpful. -- Thanks in advance.
A common problem for this is: metric in tensorflow should return two Ops: update_op and value_op. Estimator calls the update_op for each batch of the data in input source and, once it is exhausted, it call the value_op to get the metric values. The value_op here should have dependency back to variables reading only.
Many model_fn puts the dependency of value_op with the input pipeline, so, estimator.evaluate will thereby trigger the input pipeline one more time, which errors out with OutOfRangeError
问题确实是我如何在 model_fn
中定义 eval_metric
。在我的实际代码中,我要优化的总损失由多个损失(重建 + L2 + KL)组成,在评估部分,我想获得重建损失(在验证数据上),这取决于输入数据管道。我的实际重建成本比 MSE 更复杂(其他 tf.metric 函数的 none),使用 tf.metric 基本函数实现起来并不简单。
这是“xiejw”的解决问题的建议:
my_total_loss = ... # the loss you care. Pay attention to how you reduce the loss.
eval_metric_ops = {'total_loss: tf.metrics.mean(my_total_loss)}
我正在使用 tf.estimator.train_and_evaluate
和 tf.data.Dataset
向估算器提供数据:
输入数据函数:
def data_fn(data_dict, batch_size, mode, num_epochs=10):
dataset = {}
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = tf.data.Dataset.from_tensor_slices(data_dict['train_data'].astype(np.float32))
dataset = dataset.cache()
dataset = dataset.shuffle(buffer_size= batch_size * 10).repeat(num_epochs).batch(batch_size)
else:
dataset = tf.data.Dataset.from_tensor_slices(data_dict['valid_data'].astype(np.float32))
dataset = dataset.cache()
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
return next_element
训练函数:
def train_model(data):
tf.logging.set_verbosity(tf.logging.INFO)
config = tf.ConfigProto(allow_soft_placement=True,
log_device_placement=False)
config.gpu_options.allow_growth = True
run_config = tf.contrib.learn.RunConfig(
save_checkpoints_steps=10,
keep_checkpoint_max=10,
session_config=config
)
train_input = lambda: data_fn(data, 100, tf.estimator.ModeKeys.TRAIN, num_epochs=1)
eval_input = lambda: data_fn(data, 1000, tf.estimator.ModeKeys.EVAL)
estimator = tf.estimator.Estimator(model_fn=model_fn, params=hps, config=run_config)
train_spec = tf.estimator.TrainSpec(train_input, max_steps=100)
eval_spec = tf.estimator.EvalSpec(eval_input,
steps=None,
throttle_secs = 30)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
训练进行得很好,但是在评估时我得到了这个错误:
OutOfRangeError (see above for traceback): End of sequence
如果我不在评估数据集上使用 Dataset.batch
(通过省略 data_fn
中的 dataset[name] = dataset[name].batch(batch_size)
行),我会得到同样的错误,但时间要长得多。
如果我不对数据进行批处理并使用 steps=1
进行评估,我只能避免此错误,但是否对整个数据集执行评估?
我不明白导致此错误的原因,因为文档表明我也应该能够对批次进行评估。
注意:在数据批处理上使用 tf.estimator.evaluate
时出现同样的错误。
我将此问题作为 github 问题发布,这是 Tensorflow 团队的回复:
https://github.com/tensorflow/tensorflow/issues/19541
为了完整性从“xiejw”复制:
If I understand correctly, this issue is "once give estimator an input_fn with dataset inside, the evaluate process will error out with OutOfRangeError."
Estimator can handle this correctly actually. However, a known common root cause for this is metrics defined in model_fn have bug. We need to rule that part out first.
@mrezak if possible, can you show the code about the model_fn? Or if you have a minimal reproducible script, that will be extremely helpful. -- Thanks in advance.
A common problem for this is: metric in tensorflow should return two Ops: update_op and value_op. Estimator calls the update_op for each batch of the data in input source and, once it is exhausted, it call the value_op to get the metric values. The value_op here should have dependency back to variables reading only.
Many model_fn puts the dependency of value_op with the input pipeline, so, estimator.evaluate will thereby trigger the input pipeline one more time, which errors out with OutOfRangeError
问题确实是我如何在 model_fn
中定义 eval_metric
。在我的实际代码中,我要优化的总损失由多个损失(重建 + L2 + KL)组成,在评估部分,我想获得重建损失(在验证数据上),这取决于输入数据管道。我的实际重建成本比 MSE 更复杂(其他 tf.metric 函数的 none),使用 tf.metric 基本函数实现起来并不简单。
这是“xiejw”的解决问题的建议:
my_total_loss = ... # the loss you care. Pay attention to how you reduce the loss.
eval_metric_ops = {'total_loss: tf.metrics.mean(my_total_loss)}