为什么我在 google cloud ml 上训练模型时出现内存不足异常?

Why do I get out of memory exception during training model on google cloud ml?

我关注下tutorial to train object detection TensorFlow 1.3 model. I want to retrain faster_rcnn_resnet101_coco or faster_rcnn_inception_resnet_v2_atrous_coco models with my small data set (1 class, ~100 examples) on Google cloud. I have changed a number of classes and PATH_TO_BE_CONFIGURED as were suggested in the tutorial on relative config files.

数据集:12 张图片,4032∆×∆3024,每张图片有 10-20 个标记的边界框。


The replica master 0 ran out-of-memory and exited with a non-zero status of 247.


  1. 规模层BASIC_GPU
  2. default config yaml
  3. 自定义 yaml 以使用具有更多内存的实例

      runtimeVersion: "1.0"
      scaleTier: CUSTOM
      masterType: complex_model_l
      workerCount: 7
      workerType: complex_model_s
      parameterServerCount: 3
      parameterServerType: standard

你能描述一下你的数据集吗?根据我的经验,当用户 运行 遇到 OOM 问题时,通常是因为他们数据集中的图像是高分辨率的。将图像预缩放到较小尺寸有助于解决内存问题。

如果您正在处理大型数据集,我强烈建议您在配置文件 (config.yaml) 中使用 "large_model",并且您应该通过指定runtimeVersion 为“1.4”。您选择了“1.0”,这导致 ML 引擎 select TensorFlow 版本 1.0。有关这方面的更多信息,请参阅 Runtime Version 其中说:

"You can specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Unless you have a compelling reason to, you should let the training service use its default version, which is always the latest stable version."


 runtimeVersion: "1.4"
 scaleTier: CUSTOM
 masterType: large_model
 workerCount: 7
 workerType: complex_model_l
 parameterServerCount: 3
 parameterServerType: standard


masterType: large_model
