使用 Azure 机器学习服务训练大型模型时如何克服 TrainingException？

Question

我正在训练一个大型模型，试图在 Azure 笔记本中用于 Azure Machine Learning service。

因此我创建了一个 Estimator 在本地进行训练：

from azureml.train.estimator import Estimator

estimator = Estimator(source_directory='./source_dir',
                      compute_target='local',
                      entry_script='train.py')

（我的 train.py 应该从一个大的词向量文件开始加载和训练）。

当运行

run = experiment.submit(config=estimator)

我明白了

TrainingException:

====================================================================

While attempting to take snapshot of /data/home/username/notebooks/source_dir Your total snapshot size exceeds the limit of 300.0 MB. Please see http://aka.ms/aml-largefiles on how to work with large files.

====================================================================

错误中提供的link很可能是broken。我的./source_dir里面的内容确实超过了300MB。
我该如何解决这个问题？

Answer 1

您可以将训练文件放在 source_dir 之外，这样它们就不会在提交实验时被上传，然后将它们单独上传到数据存储（基本上是使用关联的 Azure 存储与您的工作区）。您需要做的就是参考 train.py 中的培训文件。

有关如何将数据上传到数据存储然后从训练文件访问它的示例，请参阅 Train model tutorial。

Answer 2

在阅读了 Azure ML 服务的 GitHub 问题 Encounter |total Snapshot size 300MB while start logging and the offical document Manage and request quotas for Azure resources 后，我认为这是一个未知问题，需要一些时间等待 Azure 修复。

同时，我建议您可以尝试将当前工作迁移到 Azure Databricks 上的 Azure ML 的其他服务Azure Databricks, to upload your dataset and codes and then run it in the notebook of Azure Databricks which is host on HDInsight Spark Cluster without any worry about memory or storage limits. You can refer to these samples。

使用 Azure 机器学习服务训练大型模型时如何克服 TrainingException？

How to overcome TrainingException when training a large model with Azure Machine Learning service?

python

azure

azure-machine-learning-service