培训工作在 Sagemaker 中停止
Training Job is Stopping in Sagemaker
最近,我在 AWS 上更改了帐户,并且在 Sagemaker 中遇到了奇怪的错误。
基本上,我只是用这种方式用一些玩具数据集检查 xgboost
算法:
from sagemaker import image_uris
xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")
clf = sagemaker.estimator.Estimator(xgb_image_uri,
role, 1, 'ml.c4.2xlarge',
output_path="s3://{}/output".format(session.default_bucket()),
sagemaker_session=session)
clf.fit(location_data)
然后开始执行训练作业,但由于某种原因,在下载数据步骤中它停止了训练作业并显示以下消息:
2021-10-21 17:33:27 Downloading - Downloading input data
2021-10-21 17:33:27 Stopping - Stopping the training job
2021-10-21 17:33:27 Stopped - Training job stopped
ProfilerReport-1634837444: Stopping
..
Job ended with status 'Stopped' rather than 'Completed'. This could mean the job timed out or stopped early for some other reason: Consider checking whether it completed as you expect.
此外,当我尝试返回训练作业部分并检查 cloudwatch 中的日志时,没有显示任何内容。这是常见问题吗?谁遇到过这个问题?有什么解决方法吗?
问题最有可能与在创建实例之前运行的 sagemaker 模板有关。
最近,我在 AWS 上更改了帐户,并且在 Sagemaker 中遇到了奇怪的错误。
基本上,我只是用这种方式用一些玩具数据集检查 xgboost
算法:
from sagemaker import image_uris
xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")
clf = sagemaker.estimator.Estimator(xgb_image_uri,
role, 1, 'ml.c4.2xlarge',
output_path="s3://{}/output".format(session.default_bucket()),
sagemaker_session=session)
clf.fit(location_data)
然后开始执行训练作业,但由于某种原因,在下载数据步骤中它停止了训练作业并显示以下消息:
2021-10-21 17:33:27 Downloading - Downloading input data
2021-10-21 17:33:27 Stopping - Stopping the training job
2021-10-21 17:33:27 Stopped - Training job stopped
ProfilerReport-1634837444: Stopping
..
Job ended with status 'Stopped' rather than 'Completed'. This could mean the job timed out or stopped early for some other reason: Consider checking whether it completed as you expect.
此外,当我尝试返回训练作业部分并检查 cloudwatch 中的日志时,没有显示任何内容。这是常见问题吗?谁遇到过这个问题?有什么解决方法吗?
问题最有可能与在创建实例之前运行的 sagemaker 模板有关。