AWS Sagemaker |为什么多实例训练需要时间乘以实例数

Question

我正在使用 AWS Sagemaker 进行模型训练和部署，这是模型训练的示例

from sagemaker.estimator import Estimator
hyperparameters = {'train-steps': 10}
instance_type = 'ml.m4.xlarge'

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name=ecr_image,
                      hyperparameters=hyperparameters)

estimator.fit(data_location)

这里提到的docker图像是一个tensorflow系统。

假设训练模型需要 1000 秒，现在我将实例数增加到 5，那么训练时间将增加 5 倍，即 5000 秒。据我了解，训练工作将分配给 5 台机器，因此理想情况下每台机器需要 200 秒，但似乎它在每台机器上进行单独的训练。有人可以让我知道它在一般分布式系统上或在 Tensorflow 上的工作情况。

我试图在这个文档中找到答案https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf，但似乎这里没有提到在分布式机器上工作的方式。

Answer 1

您正在使用 TensorFlow estimator APIs in your script? If yes, I think you should run the script by wrapping it in sagemaker.tensorflow.TensorFlow class as described in the documentation here。如果您运行以这种方式训练，实例之间的并行化和通信应该开箱即用。

但请注意，当您增加实例数时，缩放将不是线性的。实例之间的通信需要时间，并且脚本中可能存在不可并行化的瓶颈，例如将数据加载到内存中。

AWS Sagemaker |为什么多实例训练需要时间乘以实例数

AWS Sagemaker | Why multiple instances training taking time multiplied to instance number

python

tensorflow

amazon-sagemaker