Google 云对象检测模型训练错误
Google cloud object detection model training error
我在 google 中训练计算机视觉模型时遇到问题,我确信问题与 GPU 有关。我知道 google 说默认你有 1 个 GPU 训练失败并显示此消息错误:
"请求8个K80加速器超过允许最大值0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0TPU_V2, 0TPU_V2_POD, 0TPU_V3, 0 TPU_V3_POD,0 个 V100 加速器。
你可以看到我有 0 个来自所有加速器
这是我正在尝试的完整命令 运行 :
gcloud ai-platform jobs submit training segmentation_maskrcnn_test_0 ^
--runtime-version 2.1 ^
--python-version 3.7 ^
--job-dir=gs://image-segmentation-b/training-process ^
--package-path ./object_detection ^
--module-name object_detection.model_main_tf2 ^
--region us-central1 ^
--scale-tier CUSTOM ^
--master-machine-type n1-highcpu-32 ^
--master-accelerator count=8,type=nvidia-tesla-k80 ^
-- ^
--model_dir=gs://image-segmentation-b/training-process ^
--pipeline_config_path=gs:gs://image-segmentation-b/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8 - cloud.config
这是完整的错误:
ERROR: (gcloud.ai-platform.jobs.submit.training) HttpError accessing <https://ml.googleapis.com/v1/projects/project id/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'content-encoding': 'gzip', 'date': 'Tue, 18 Jan 2022 11:12:39 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'alt-svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', 'transfer-encoding': 'chunked', 'status': 429}>, content <{
"error": {
"code": 429,
"message": "Quota failure for project project id. The request for 8 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.",
"status": "RESOURCE_EXHAUSTED",
"details": [
{
"@type": "type.googleapis.com/google.rpc.QuotaFailure",
"violations": [
{
"subject": "project id",
"description": "The request for 8 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators."
}
]
}
]
}
}
>
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
我该如何解决这个错误?我必须去某个地方为项目启用 GPU 吗?
您需要提高 GPU 配额才能训练您的模型。
您的项目或您的帐户没有足够的 GPU 配额来满足您的请求。
您可以在此处查看您的配额:API Quotas
我在 google 中训练计算机视觉模型时遇到问题,我确信问题与 GPU 有关。我知道 google 说默认你有 1 个 GPU 训练失败并显示此消息错误: "请求8个K80加速器超过允许最大值0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0TPU_V2, 0TPU_V2_POD, 0TPU_V3, 0 TPU_V3_POD,0 个 V100 加速器。
你可以看到我有 0 个来自所有加速器
这是我正在尝试的完整命令 运行 :
gcloud ai-platform jobs submit training segmentation_maskrcnn_test_0 ^
--runtime-version 2.1 ^
--python-version 3.7 ^
--job-dir=gs://image-segmentation-b/training-process ^
--package-path ./object_detection ^
--module-name object_detection.model_main_tf2 ^
--region us-central1 ^
--scale-tier CUSTOM ^
--master-machine-type n1-highcpu-32 ^
--master-accelerator count=8,type=nvidia-tesla-k80 ^
-- ^
--model_dir=gs://image-segmentation-b/training-process ^
--pipeline_config_path=gs:gs://image-segmentation-b/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8 - cloud.config
这是完整的错误:
ERROR: (gcloud.ai-platform.jobs.submit.training) HttpError accessing <https://ml.googleapis.com/v1/projects/project id/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'content-encoding': 'gzip', 'date': 'Tue, 18 Jan 2022 11:12:39 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'alt-svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', 'transfer-encoding': 'chunked', 'status': 429}>, content <{
"error": {
"code": 429,
"message": "Quota failure for project project id. The request for 8 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.",
"status": "RESOURCE_EXHAUSTED",
"details": [
{
"@type": "type.googleapis.com/google.rpc.QuotaFailure",
"violations": [
{
"subject": "project id",
"description": "The request for 8 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators."
}
]
}
]
}
}
>
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
我该如何解决这个错误?我必须去某个地方为项目启用 GPU 吗?
您需要提高 GPU 配额才能训练您的模型。
您的项目或您的帐户没有足够的 GPU 配额来满足您的请求。
您可以在此处查看您的配额:API Quotas