从 1.x 迁移到 Tensorflow 2.x 会导致 Google AI 平台上的训练和 ResourceExhaustedErrors 变慢得多

Question

在 Tensorflow 1.14 上一切正常。由于各种原因，我现在不得不更新它，而且似乎训练（我作为 Google AI 平台作业进行的训练）已经显着退化：我现在的模型得到 ResourceExhaustedError，即使我减少了批量大小一堆来解决这个问题（无论如何我都不想这样做）训练速度减慢了大约 5 倍。

我的迁移可以概括为我的配置 yaml 已更改：

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  runtimeVersion: "1.14"

到

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  runtimeVersion: "2.5"
  pythonVersion: "3.7"

并将所有相关代码更新为 TF2.x 显然也兼容。我也尝试摆弄 scaleTier 和 masterType 无济于事。

我的模型是基于 Keras 的，涉及 LSTM 并且有大约 200 万和 550 万个参数。

我可以在这里做什么？为什么当我进行此更改时 google AI 平台上的训练质量会出现这种极端下降？

Answer 1

看来问题是我在我的 LSTM 模型中使用 recurrent_dropout，在 Tensorflow 2.x 中似乎不再支持 GPU 训练。从我的 LSTM 层中删除该参数后，问题就消失了。

值得注意的是，migration instructions nor the tf_upgrade_v2 脚本对此根本没有帮助。

从 1.x 迁移到 Tensorflow 2.x 会导致 Google AI 平台上的训练和 ResourceExhaustedErrors 变慢得多

Migrating to Tensorflow 2.x from 1.x results in much slower training and ResourceExhaustedErrors on Google AI platform

keras

tensorflow

tensorflow2.0

google-ai-platform