将检查点存储到 Google 云存储桶时出错
Error when storing checkpoints to Google Cloud bucket
我 运行 在 Google 云上使用 ML 引擎的 Tensorflow 模型,检查点保护程序无法将文件保存到存储桶中。我正在使用 TensorFlow 1.4,并且 tf.Estimator
的方法是 tf.estimator.train_and_evaluate
.
这些是日志记录,其中 gs://e-trial-central1/models/1530351907.8359423
是为估算器提供的参数 model_dir
:
E master-replica-0 Couldn't match files for checkpoint gs://e-trial-central1/models/1530351907.8359423/.
I master-replica-0 Create CheckpointSaverHook.
I master-replica-0 Restoring parameters from gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
根据其他帖子 (here and ) 的建议,我已经尝试过的事情:
- 保存到区域存储桶 (us-central1) 而不是多区域存储桶。这会导致相同的错误。
- 使用不包含“.”的更简单的路径在文件夹名称中。这会导致相同的错误。
- 保存到本地路径,而不是存储桶。这行得通!但我最终还是想要存储桶中的文件。
与其他帖子相比,这里有点奇怪的是检查点路径实际上已损坏。有 '。'在模型目录而不是 Tensorflow 模式之后 (model.ckpt
)。
此外,当我查看存储桶中的模型目录时失败后,实际上那里有文件 - TF 事件文件,以及 .index
、.meta
和 .data...
文件,但检查点文件不在那里。
有什么想法会导致这种情况吗?或者接下来要尝试什么?
非常感谢任何帮助!
已通过迁移到更新版本的 Tensorflow (1.8) 解决此问题。
我 运行 在 Google 云上使用 ML 引擎的 Tensorflow 模型,检查点保护程序无法将文件保存到存储桶中。我正在使用 TensorFlow 1.4,并且 tf.Estimator
的方法是 tf.estimator.train_and_evaluate
.
这些是日志记录,其中 gs://e-trial-central1/models/1530351907.8359423
是为估算器提供的参数 model_dir
:
E master-replica-0 Couldn't match files for checkpoint gs://e-trial-central1/models/1530351907.8359423/.
I master-replica-0 Create CheckpointSaverHook.
I master-replica-0 Restoring parameters from gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
根据其他帖子 (here and
- 保存到区域存储桶 (us-central1) 而不是多区域存储桶。这会导致相同的错误。
- 使用不包含“.”的更简单的路径在文件夹名称中。这会导致相同的错误。
- 保存到本地路径,而不是存储桶。这行得通!但我最终还是想要存储桶中的文件。
与其他帖子相比,这里有点奇怪的是检查点路径实际上已损坏。有 '。'在模型目录而不是 Tensorflow 模式之后 (model.ckpt
)。
此外,当我查看存储桶中的模型目录时失败后,实际上那里有文件 - TF 事件文件,以及 .index
、.meta
和 .data...
文件,但检查点文件不在那里。
有什么想法会导致这种情况吗?或者接下来要尝试什么?
非常感谢任何帮助!
已通过迁移到更新版本的 Tensorflow (1.8) 解决此问题。