自定义分布式 ML 引擎实验所需的代码更改

Question

我完成了这个 tutorial on distributed tensorflow experiments within an ML Engine experiment and I am looking to define my own custom tier instead of the STANDARD_1 tier that they use in their config.yaml 文件。如果使用 tf.estimator.Estimator API，是否需要任何额外的代码更改来创建任意大小的自定义层？例如，文章建议："If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." 所以这表明下面的 config.yaml 文件是可能的

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: complex_model_m
  workerCount: 10
  parameterServerCount: 4

是否需要对 mnist 教程进行任何代码更改才能使用此自定义配置？这是否会像教程建议的那样在 10 个工人中分配 X 个批次？我查看了其他一些 ML 引擎样本，发现 reddit_tft 使用分布式训练，但他们似乎在他们的训练包中定义了自己的 runconfig.cluster_spec：task.py 尽管他们也是使用 Estimator API。那么，是否需要任何额外的配置？我目前的理解是，如果使用 Estimator API（即使在您自己定义的模型中），则不需要进行任何其他更改。

如果 config.yaml 指定使用 GPU，这会发生任何变化吗？这个 article 建议使用 Estimator API "No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs." 但是，由于 config.yaml 专门标识参数服务器和工作人员的机器类型，我希望在 ML-Engine 中ClusterSpec 将根据 config.yaml 文件正确配置。但是，我无法找到任何 ml-engine 文档来确认无需更改即可利用 GPU。

最后，在 ML-Engine 中，我想知道是否有任何方法可以识别不同配置的使用情况？ "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." 这行表明使用额外的 worker 大致是线性的，但我对如何确定是否需要更多参数服务器没有任何直觉？可以检查什么（在云仪表板或张量板上）以确定它们是否有足够数量的参数服务器？

Answer 1

are any additional code changes needed to create a custom tier of any size?

否；无需对 MNIST 示例进行任何更改即可使其适用于不同数量或类型的工作人员。要在 CloudML 引擎上使用 tf.estimator.Estimator，您必须让程序调用 learn_runner.run，如 exemplified in the samples. When you do so, the framework reads in the TF_CONFIG environment variables and populates a RunConfig object with the relevant information such as the ClusterSpec。它会自动在参数服务器节点上做正确的事情，它会使用提供的 Estimator 开始训练和评估。

大多数奇迹的发生是因为 tf.estimator.Estimator 自动使用了正确分配操作的设备 setter。该设备 setter 使用来自 RunConfig 对象的集群信息，默认情况下，其构造函数使用 TF_CONFIG 来发挥其魔力（例如 here). You can see where the device setter is being used here.

这一切都意味着您可以通过 adding/removing 工人 and/or 改变他们的类型来改变您的 config.yaml 并且通常应该可以正常工作。

有关使用自定义 model_fn 的示例代码，请参阅 census/customestimator 示例。

也就是说，请注意，当您添加工作人员时，您正在增加有效批量大小（无论您是否使用 tf.estimator，这都是真实的）。也就是说，如果您的 batch_size 是 50 而您使用的是 10 个工人，这意味着每个工人正在处理大小为 50 的批次，有效批次大小为 10*50=500。那么如果你将worker的数量增加到20，你的有效batch size就变成了20*50=1000。您可能会发现您可能需要相应地降低学习率（线性似乎通常效果很好；ref）。

I poked around some of the other ML Engine samples and found that reddit_tft uses distributed training, but they appear to have defined their own runconfig.cluster_spec within their trainer package: task.pyeven though they are also using the Estimator API. So, is there any additional configuration needed?

无需额外配置。 reddit_tft 示例确实实例化了自己的 RunConfig，但是，RunConfig 的构造函数使用 TF_CONFIG 获取在实例化期间未明确设置的任何属性。它这样做只是为了方便计算出有多少参数服务器和工人。

Does any of this change if the config.yaml specifies using GPUs?

除了可能需要手动将操作分配给 GPU 之外（但这并非特定于 CloudML Engine），您无需更改任何内容即可将 tf.estimator.Estimator 用于 GPU；有关详细信息，请参阅 this article。我会研究澄清文档。

自定义分布式 ML 引擎实验所需的代码更改

Code changes needed for custom distributed ML Engine Experiment

google-cloud-platform

tensorflow

google-cloud-ml-engine

tensorflow-gpu