培训工作在 Sagemaker 中停止

Training Job is Stopping in Sagemaker

最近,我在 AWS 上更改了帐户,并且在 Sagemaker 中遇到了奇怪的错误。

基本上,我只是用这种方式用一些玩具数据集检查 xgboost 算法:

from sagemaker import image_uris

xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")

clf = sagemaker.estimator.Estimator(xgb_image_uri,
                   role, 1, 'ml.c4.2xlarge',
                   output_path="s3://{}/output".format(session.default_bucket()),
                   sagemaker_session=session)

clf.fit(location_data)

然后开始执行训练作业,但由于某种原因,在下载数据步骤中它停止了训练作业并显示以下消息:

2021-10-21 17:33:27 Downloading - Downloading input data
2021-10-21 17:33:27 Stopping - Stopping the training job
2021-10-21 17:33:27 Stopped - Training job stopped
ProfilerReport-1634837444: Stopping
..
Job ended with status 'Stopped' rather than 'Completed'. This could mean the job timed out or stopped early for some other reason: Consider checking whether it completed as you expect.

此外,当我尝试返回训练作业部分并检查 cloudwatch 中的日志时,没有显示任何内容。这是常见问题吗?谁遇到过这个问题?有什么解决方法吗?

问题最有可能与在创建实例之前运行的 sagemaker 模板有关。