如何在 AI 平台上同时运行多个 GPU 加速训练作业

How to run multiple GPU-accelerated Training Jobs concurrently on AI-Platform

我在 AI Platform 上使用 "scaleTier": "BASIC_GPU" 设置运行 tensorflow 训练作业。我的理解是此设置使用单个 Tesla K80 GPU 来完成我的工作。

在另一个作业已经运行时创建新作业似乎会导致新创建的作业被放入队列中，直到运行作业完成。当我检查新作业的日志时，我看到这条消息：

This job is number 1 in the queue and requires 8.000000 CPUs and 1 K80 accelerators. The project is using 8.000000 CPUs out of 450 allowed and 1 K80 accelerators out of 0 TPU_V2_POD, 0 TPU_V3_POD, 1 K80, 1 P100, 1 V100, 4 P4, 4 T4, 8 TPU_V2, 8 TPU_V3 allowed across all regions.The project is using 8.000000 CPUs out of 20 allowed and 1 K80 accelerators out of 0 TPU_V2_POD, 0 TPU_V3_POD, 1 K80, 1 P100, 1 P4, 1 T4, 1 V100, 8 TPU_V2, 8 TPU_V3 allowed in the region us-central1.

这个AI Platform documentation好像说我的项目应该能够同时使用最多30个K80 GPU。

为什么我连2个都不能同时使用？

我需要做些什么来将我的限制增加到预期的 30 吗？

您的项目管理员似乎对您可以使用的 GPU 数量设置了配额（请注意错误消息说您的配额是 us-central1 中的 20 个 cpus、1 个 K80、1 个 P100），所以工作正在等待 K-80 可用。

两个选项：

(1) 转到 console.cloud.google。com/iam-admin/quotas 查找计算引擎 API 和 K80s 执行“编辑配额”，或要求您的管理员在必要时增加配额。确保同时编辑所有区域配额和 us-central1 配额。否则，如果管理员给了你每个区域 1 个 GPU，运行 us-west1 等的工作

(2) 看来你有P100可用，所以使用自定义缩放层并指定P100。

对于新项目，默认配额会很低。您可以通过 this form.

请求增加更多配额

如何在 AI 平台上同时运行多个 GPU 加速训练作业

How to run multiple GPU-accelerated Training Jobs concurrently on AI-Platform

google-cloud-platform

google-cloud-ml

gcp-ai-platform-training

如何在 AI 平台上同时 运行 多个 GPU 加速训练作业

How to run multiple GPU-accelerated Training Jobs concurrently on AI-Platform

google-cloud-platform

google-cloud-ml

gcp-ai-platform-training

如何在 AI 平台上同时运行多个 GPU 加速训练作业