成功完成 1000 后,Cloud ML 上的作业失败
Job failed on Cloud ML after successful completion of 1000
我已经完成了关于人口普查数据的 cloudML 教程:cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction,其中作业成功。但是,当我浏览有关花卉图像数据的本教程时:https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow my training task appears to successful based on the completion of 1000 steps from the log. However, upon completion from this snapshot StackDriver logs,它说作业失败。我尝试使用相同的结构替换人口普查数据演练中的命令行参数,删除并重新创建 JOB_ID 和 --output_path 用户参数,使用 STANDARD_1 比例层但无济于事。我能从社区获得的任何帮助都将不胜感激。谢谢!
下面是错误,您可以看到它在日志快照的尾端弹出:
{
textPayload: "The replica master 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
run(model, argv)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
dispatch(args, model, cluster, task)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
Trainer(args, model, cluster, task).run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
self.eval(session)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
self.model.format_metric_values(self.evaluator.evaluate()))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 95, in evaluate
return metric_values
File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 234, in _run
sess.run(enqueue_op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
NotFoundError: Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
[[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
Caused by op u'ReaderReadUpToV2', defined at:
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
run(model, argv)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
dispatch(args, model, cluster, task)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
Trainer(args, model, cluster, task).run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
self.eval(session)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
self.model.format_metric_values(self.evaluator.evaluate()))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in evaluate
self.eval_batch_size)
File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 310, in build_eval_graph
return self.build_graph(data_paths, batch_size, GraphMod.EVALUATE)
File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph
num_epochs=None if is_training else 2)
File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 52, in read_examples
filename_queue, batch_size)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 226, in read_up_to
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 380, in _reader_read_up_to_v2
num_records=num_records, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
[[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
To find out more about why your job exited please check the logs: console.cloud.google.com/logs/viewer?project=123456234&resource=ml_job%2Fjob_id%2Fflowers_User_20170524_145125&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22flowers_User_20170524_145125%22"***
该错误表明尝试读取时未找到 404
gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
那个文件存在吗?
根据名字我猜是评价数据。所以我猜你每 1000 步只进行 运行ning 评估,这就是为什么 1000 步成功完成的原因。然后它尝试 运行 评估但失败,因为数据不存在。
我已经完成了关于人口普查数据的 cloudML 教程:cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction,其中作业成功。但是,当我浏览有关花卉图像数据的本教程时:https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow my training task appears to successful based on the completion of 1000 steps from the log. However, upon completion from this snapshot StackDriver logs,它说作业失败。我尝试使用相同的结构替换人口普查数据演练中的命令行参数,删除并重新创建 JOB_ID 和 --output_path 用户参数,使用 STANDARD_1 比例层但无济于事。我能从社区获得的任何帮助都将不胜感激。谢谢!
下面是错误,您可以看到它在日志快照的尾端弹出:
{
textPayload: "The replica master 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
run(model, argv)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
dispatch(args, model, cluster, task)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
Trainer(args, model, cluster, task).run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
self.eval(session)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
self.model.format_metric_values(self.evaluator.evaluate()))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 95, in evaluate
return metric_values
File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 234, in _run
sess.run(enqueue_op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
NotFoundError: Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
[[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
Caused by op u'ReaderReadUpToV2', defined at:
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
run(model, argv)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
dispatch(args, model, cluster, task)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
Trainer(args, model, cluster, task).run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
self.eval(session)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
self.model.format_metric_values(self.evaluator.evaluate()))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in evaluate
self.eval_batch_size)
File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 310, in build_eval_graph
return self.build_graph(data_paths, batch_size, GraphMod.EVALUATE)
File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph
num_epochs=None if is_training else 2)
File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 52, in read_examples
filename_queue, batch_size)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 226, in read_up_to
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 380, in _reader_read_up_to_v2
num_records=num_records, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
[[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
To find out more about why your job exited please check the logs: console.cloud.google.com/logs/viewer?project=123456234&resource=ml_job%2Fjob_id%2Fflowers_User_20170524_145125&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22flowers_User_20170524_145125%22"***
该错误表明尝试读取时未找到 404
gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
那个文件存在吗?
根据名字我猜是评价数据。所以我猜你每 1000 步只进行 运行ning 评估,这就是为什么 1000 步成功完成的原因。然后它尝试 运行 评估但失败,因为数据不存在。