扩展 Tensor2Tensor Transformer TPU 教程中的 OutOfRangeError
OutOfRangeError in scaling up Tensor2Tensor Transformer TPU tutorial
我遵循了 T2T Transformer "Train a language model" 示例,它在 10 个训练步骤中起作用。然而,当扩展到 250,000 步时,我得到了一个 OutOfRange 错误(如下)。这是解析问题还是其他问题?
INFO:tensorflow:Init TPU system
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
WARNING:tensorflow:
Error occurred during infeed/outfeed. This may be due to a compile error in the main session. Waiting for a short time for the main session to come back.
End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
Caused by op 'input_pipeline_task0/while/IteratorGetNext', defined at:
File "/usr/local/bin/t2t-trainer", line 32, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/usr/local/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 359, in main
execute_schedule(exp)
...
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 729, in enqueue_ops_fn
features, labels = inputs.features_and_labels()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2766, in features_and_labels
return _Inputs._parse_inputs(self._iterator.get_next())
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
ERROR:tensorflow:Feed error: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
During handling of the above exception, another exception occurred:
...
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to `Session::Close()`.
我假设您已按照 document 中的说明进行操作。
输出中的相关错误是“序列结束”上的“OutOfRangeError”行。这个错误是输入管道用来让上游知道没有更多数据要处理的信号。
您需要确保有数据供 TPU 处理,方法是确保以下各项:
TPU 可以访问训练数据(例如 GCS 桶)
命令中的路径没有拼写错误,最重要的是,
您的数据集要么很大,要么您有一个 dataset.repeat() 来确保您的训练数据在您的 TPU 完成配置的训练步骤数之前不会 运行。
这里是 Tensor2Tensor 库的作者之一。
简答:减少--eval_steps
。
长答案:
不幸的是,TPUEstimator
,我们在 TPU 上 运行 底层使用的库,当您 运行 没有输入数据时,不会捕获 OutOfRangeError
。在训练期间这不是问题,因为输入数据是无限的(我们在输入上调用 repeat tf.data.Dataset
)。但是,在评估期间,您希望对数据进行 1 次传递,这意味着您需要正确设置 --eval_steps
,以免耗尽输入数据。希望 TPUEstimator
将很快处理捕获错误,这样您就不必计算需要执行多少个评估步骤 运行。
我遵循了 T2T Transformer "Train a language model" 示例,它在 10 个训练步骤中起作用。然而,当扩展到 250,000 步时,我得到了一个 OutOfRange 错误(如下)。这是解析问题还是其他问题?
INFO:tensorflow:Init TPU system
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
WARNING:tensorflow:
Error occurred during infeed/outfeed. This may be due to a compile error in the main session. Waiting for a short time for the main session to come back.
End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
Caused by op 'input_pipeline_task0/while/IteratorGetNext', defined at:
File "/usr/local/bin/t2t-trainer", line 32, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/usr/local/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 359, in main
execute_schedule(exp)
...
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 729, in enqueue_ops_fn
features, labels = inputs.features_and_labels()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2766, in features_and_labels
return _Inputs._parse_inputs(self._iterator.get_next())
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
ERROR:tensorflow:Feed error: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
During handling of the above exception, another exception occurred:
...
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to `Session::Close()`.
我假设您已按照 document 中的说明进行操作。 输出中的相关错误是“序列结束”上的“OutOfRangeError”行。这个错误是输入管道用来让上游知道没有更多数据要处理的信号。
您需要确保有数据供 TPU 处理,方法是确保以下各项: TPU 可以访问训练数据(例如 GCS 桶) 命令中的路径没有拼写错误,最重要的是, 您的数据集要么很大,要么您有一个 dataset.repeat() 来确保您的训练数据在您的 TPU 完成配置的训练步骤数之前不会 运行。
这里是 Tensor2Tensor 库的作者之一。
简答:减少--eval_steps
。
长答案:
不幸的是,TPUEstimator
,我们在 TPU 上 运行 底层使用的库,当您 运行 没有输入数据时,不会捕获 OutOfRangeError
。在训练期间这不是问题,因为输入数据是无限的(我们在输入上调用 repeat tf.data.Dataset
)。但是,在评估期间,您希望对数据进行 1 次传递,这意味着您需要正确设置 --eval_steps
,以免耗尽输入数据。希望 TPUEstimator
将很快处理捕获错误,这样您就不必计算需要执行多少个评估步骤 运行。