成功完成 1000 后,Cloud ML 上的作业失败

Job failed on Cloud ML after successful completion of 1000

我已经完成了关于人口普查数据的 cloudML 教程:cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction,其中作业成功。但是,当我浏览有关花卉图像数据的本教程时:https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow my training task appears to successful based on the completion of 1000 steps from the log. However, upon completion from this snapshot StackDriver logs,它说作业失败。我尝试使用相同的结构替换人口普查数据演练中的命令行参数,删除并重新创建 JOB_ID 和 --output_path 用户参数,使用 STANDARD_1 比例层但无济于事。我能从社区获得的任何帮助都将不胜感激。谢谢!

下面是错误,您可以看到它在日志快照的尾端弹出:

{
 textPayload: "The replica master 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
    run(model, argv)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
    dispatch(args, model, cluster, task)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
    self.eval(session)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
    self.model.format_metric_values(self.evaluator.evaluate()))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 95, in evaluate
    return metric_values
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 234, in _run
    sess.run(enqueue_op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
NotFoundError: Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
     when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
     [[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
Caused by op u'ReaderReadUpToV2', defined at:
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
    run(model, argv)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
    dispatch(args, model, cluster, task)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
    self.eval(session)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
    self.model.format_metric_values(self.evaluator.evaluate()))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in evaluate
    self.eval_batch_size)
  File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 310, in build_eval_graph
    return self.build_graph(data_paths, batch_size, GraphMod.EVALUATE)
  File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph
    num_epochs=None if is_training else 2)
  File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 52, in read_examples
    filename_queue, batch_size)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 226, in read_up_to
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 380, in _reader_read_up_to_v2
    num_records=num_records, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()
NotFoundError (see above for traceback): Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
     when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
     [[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
To find out more about why your job exited please check the logs: console.cloud.google.com/logs/viewer?project=123456234&resource=ml_job%2Fjob_id%2Fflowers_User_20170524_145125&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22flowers_User_20170524_145125%22"***

该错误表明尝试读取时未找到 404

gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval

那个文件存在吗?

根据名字我猜是评价数据。所以我猜你每 1000 步只进行 运行ning 评估,这就是为什么 1000 步成功完成的原因。然后它尝试 运行 评估但失败,因为数据不存在。