Google 云对象检测模型训练错误

Question

我在 google 中训练计算机视觉模型时遇到问题，我确信问题与 GPU 有关。我知道 google 说默认你有 1 个 GPU 训练失败并显示此消息错误： "请求8个K80加速器超过允许最大值0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0TPU_V2, 0TPU_V2_POD, 0TPU_V3, 0 TPU_V3_POD，0 个 V100 加速器。

你可以看到我有 0 个来自所有加速器

这是我正在尝试的完整命令运行 :

gcloud ai-platform jobs submit training segmentation_maskrcnn_test_0 ^
--runtime-version 2.1 ^
--python-version 3.7 ^
--job-dir=gs://image-segmentation-b/training-process ^
--package-path ./object_detection ^
--module-name object_detection.model_main_tf2 ^
--region us-central1 ^
--scale-tier CUSTOM ^
--master-machine-type n1-highcpu-32 ^
--master-accelerator count=8,type=nvidia-tesla-k80 ^
-- ^
--model_dir=gs://image-segmentation-b/training-process ^
--pipeline_config_path=gs:gs://image-segmentation-b/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8 - cloud.config

这是完整的错误：

ERROR: (gcloud.ai-platform.jobs.submit.training) HttpError accessing <https://ml.googleapis.com/v1/projects/project id/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'content-encoding': 'gzip', 'date': 'Tue, 18 Jan 2022 11:12:39 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'alt-svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', 'transfer-encoding': 'chunked', 'status': 429}>, content <{
  "error": {
    "code": 429,
    "message": "Quota failure for project project id. The request for 8 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.QuotaFailure",
        "violations": [
          {
            "subject": "project id",
            "description": "The request for 8 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators."
          }
        ]
      }
    ]
  }
}
>
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.

我该如何解决这个错误？我必须去某个地方为项目启用 GPU 吗？

Answer 1

您需要提高 GPU 配额才能训练您的模型。

您的项目或您的帐户没有足够的 GPU 配额来满足您的请求。

您可以在此处查看您的配额：API Quotas

Google 云对象检测模型训练错误

Google cloud object detection model training error

python

google-cloud-platform

gcloud

tensorflow

google-cloud-ai-platform-pipelines