Keras 模型在 AI 平台训练中不使用 GPU

Question

我有一个简单的 Keras 模型，我正在提交给 Google Cloud AI Platform 培训，我想使用 GPU 进行处理。

作业提交并成功完成。查看使用统计数据，GPU 从未超过 0% 利用率。但是，CPU 使用率会随着训练的进行而增加。

关于让我的模型与 GPU 一起工作可能有什么问题的想法？有什么方法可以解决这种情况？

config.yaml

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu

我使用的是运行时版本 1.13，它已经安装了 tensorflow。我的 setup.py 中额外需要的包包括：

REQUIRED_PACKAGES = ['google-api-core==1.14.2',
                     'google-cloud-core==1.0.3',
                     'google-cloud-logging==1.12.1',
                     'google-cloud-storage==1.18.0',
                     'gcsfs==0.2.3',
                     'h5py==2.9.0',
                     'joblib==0.13.2',
                     'numpy==1.16.4',
                     'pandas==0.24.2',
                     'protobuf==3.8.0',
                     'scikit-learn==0.21.2',
                     'scipy==1.3.0',
                     'Keras==2.2.4',
                     'Keras-Preprocessing==1.1.0',
                     ]

查看日志，似乎找到了 GPU

master-replica-0 Found device 0 with properties:  master-replica-0 
master-replica-0 name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 master-replica-0

更新：

该模型正在使用 GPU，但未得到充分利用。

在 AI Platform 内，作业概览页面中的 utilization graphs 比日志中显示的 activity 晚了大约 5 分钟。
因此，您的日志可能会显示正在处理的时期，但利用率图表仍会显示 0% utilization。

我是如何解决的 -

我正在使用 fit_generator 函数
我设置multiprocessing=true, queue_length=10, workers=5。我目前正在调整这些参数以确定最有效的参数，但是我现在看到我的 GPU 的利用率约为 30%。

Answer 1

该模型正在使用 GPU，但未得到充分利用。

在 AI Platform 内，作业概览页面中的 utilization graphs 比日志中显示的 activity 晚了大约 5 分钟。
因此，您的日志可能会显示正在处理的时期，但利用率图表仍会显示 0% utilization。

我是如何解决的 -

我正在使用 fit_generator 函数
我设置multiprocessing=true, queue_length=10, workers=5。我目前正在调整这些参数以确定最有效的参数，但是我现在看到我的 GPU 的利用率约为 30%。

Keras 模型在 AI 平台训练中不使用 GPU

Keras model not using GPU on ai platform training

google-cloud-platform

google-cloud-ml