Google Cloud ML：重复 "Attempting refresh to obtain initial access_token"，然后 "Job failed"

Question

我正尝试运行在 Google Cloud ML Engine 上进行训练。我正在使用

提交作业

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--job-dir=gs://${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_tpu_main \
--runtime-version 1.13 \
--scale-tier BASIC_TPU \
--region us-central1 \
-- \
--model_dir=gs://${YOUR_GCS_BUCKET}/train \
--tpu_zone us-central1 \
--pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/pipeline.config

但是，在创建作业并安装所有必需的包后，我开始反复收到这些消息：

直到作业失败并显示此输出：

我已经试过this, this and this但没有成功。

我想这个问题与身份验证有关，所以我按照这个 tutorial，但没有帮助。

非常感谢任何帮助！

Answer 1

TPU 分配似乎存在一些问题。我通过将TPU更改为GPU解决了问题，因此提交作业的命令更改为

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--job-dir=gs://${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--runtime-version 1.13 \
--scale-tier BASIC_GPU \
--region us-central1 \
-- \
--model_dir=gs://${YOUR_GCS_BUCKET}/train \
--pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/pipeline.config

更新

我已经按照他的要求联系了@Yash Sonthalia。很快问题就解决了。谢谢！

Google Cloud ML：重复 "Attempting refresh to obtain initial access_token"，然后 "Job failed"

Google Cloud ML: repeating "Attempting refresh to obtain initial access_token", then "Job failed"

gsutil

gcloud

google-cloud-ml