将检查点存储到 Google 云存储桶时出错

Error when storing checkpoints to Google Cloud bucket

我 运行 在 Google 云上使用 ML 引擎的 Tensorflow 模型,检查点保护程序无法将文件保存到存储桶中。我正在使用 TensorFlow 1.4,并且 tf.Estimator 的方法是 tf.estimator.train_and_evaluate.

这些是日志记录,其中 gs://e-trial-central1/models/1530351907.8359423 是为估算器提供的参数 model_dir

E  master-replica-0 Couldn't match files for checkpoint gs://e-trial-central1/models/1530351907.8359423/. 
I  master-replica-0 Create CheckpointSaverHook.  
I  master-replica-0 Restoring parameters from gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 

根据其他帖子 (here and ) 的建议,我已经尝试过的事情:

  1. 保存到区域存储桶 (us-central1) 而不是多区域存储桶。这会导致相同的错误。
  2. 使用不包含“.”的更简单的路径在文件夹名称中。这会导致相同的错误。
  3. 保存到本地路径,而不是存储桶。这行得通!但我最终还是想要存储桶中的文件。

与其他帖子相比,这里有点奇怪的是检查点路径实际上已损坏。有 '。'在模型目录而不是 Tensorflow 模式之后 (model.ckpt)。 此外,当我查看存储桶中的模型目录时失败后,实际上那里有文件 - TF 事件文件,以及 .index.meta.data... 文件,但检查点文件不在那里。

有什么想法会导致这种情况吗?或者接下来要尝试什么?

非常感谢任何帮助!

已通过迁移到更新版本的 Tensorflow (1.8) 解决此问题。