为什么我在 google cloud ml 上训练模型时出现内存不足异常?

Why do I get out of memory exception during training model on google cloud ml?

我关注下tutorial to train object detection TensorFlow 1.3 model. I want to retrain faster_rcnn_resnet101_coco or faster_rcnn_inception_resnet_v2_atrous_coco models with my small data set (1 class, ~100 examples) on Google cloud. I have changed a number of classes and PATH_TO_BE_CONFIGURED as were suggested in the tutorial on relative config files.

数据集:12 张图片,4032∆×∆3024,每张图片有 10-20 个标记的边界框。

为什么会出现内存不足异常?

The replica master 0 ran out-of-memory and exited with a non-zero status of 247.

请注意,我尝试了不同的配置:

  1. 规模层BASIC_GPU
  2. default config yaml
  3. 自定义 yaml 以使用具有更多内存的实例

    trainingInput:
      runtimeVersion: "1.0"
      scaleTier: CUSTOM
      masterType: complex_model_l
      workerCount: 7
      workerType: complex_model_s
      parameterServerCount: 3
      parameterServerType: standard
    

你能描述一下你的数据集吗?根据我的经验,当用户 运行 遇到 OOM 问题时,通常是因为他们数据集中的图像是高分辨率的。将图像预缩放到较小尺寸有助于解决内存问题。

如果您正在处理大型数据集,我强烈建议您在配置文件 (config.yaml) 中使用 "large_model",并且您应该通过指定runtimeVersion 为“1.4”。您选择了“1.0”,这导致 ML 引擎 select TensorFlow 版本 1.0。有关这方面的更多信息,请参阅 Runtime Version 其中说:

"You can specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Unless you have a compelling reason to, you should let the training service use its default version, which is always the latest stable version."

因此,我推荐使用以下配置:

trainingInput:
 runtimeVersion: "1.4"
 scaleTier: CUSTOM
 masterType: large_model
 workerCount: 7
 workerType: complex_model_l
 parameterServerCount: 3
 parameterServerType: standard

在上面的配置中,

masterType: large_model

允许您选择具有大量内存的机器,特别适用于模型过大(具有许多隐藏层或具有大量节点的层)时的参数服务器。希望能帮助到你。