SageMaker in local Jupyter notebook: cannot use AWS hosted XGBoost container ("KeyError: 'S3DistributionType'" and "Failed to run: ['docker-compose'")
SageMaker in local Jupyter notebook: cannot use AWS hosted XGBoost container ("KeyError: 'S3DistributionType'" and "Failed to run: ['docker-compose'")
运行 本地 Jupyter 笔记本(使用 VS 代码)中的 SageMaker 可以正常工作,除了尝试使用 AWS 托管容器训练 XGBoost 模型会导致错误(容器名称:246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
).
Jupyter 笔记本
import sagemaker
session = sagemaker.LocalSession()
# Load and prepare the training and validation data
...
# Upload the training and validation data to S3
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
region = session.boto_region_name
instance_type = 'ml.m4.xlarge'
container = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1', 'py3', instance_type=instance_type)
role = 'arn:aws:iam::<USER ID #>:role/service-role/AmazonSageMaker-ExecutionRole-<ROLE ID #>'
xgb_estimator = sagemaker.estimator.Estimator(
container, role, train_instance_count=1, train_instance_type=instance_type,
output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)
xgb_estimator.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6,
subsample=0.8, objective='reg:squarederror', early_stopping_rounds=10,
num_round=200)
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')
xgb_estimator.fit({'train': s3_input_train, 'validation': s3_input_validation})
Docker 容器密钥错误
algo-1-tfcvc_1 | ERROR:sagemaker-containers:Reporting training FAILURE
algo-1-tfcvc_1 | ERROR:sagemaker-containers:framework error:
algo-1-tfcvc_1 | Traceback (most recent call last):
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
algo-1-tfcvc_1 | entrypoint()
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
algo-1-tfcvc_1 | train(framework.training_env())
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
algo-1-tfcvc_1 | run_algorithm_mode()
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
algo-1-tfcvc_1 | checkpoint_config=checkpoint_config
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 115, in sagemaker_train
algo-1-tfcvc_1 | validated_data_config = channels.validate(data_config)
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 106, in validate
algo-1-tfcvc_1 | channel_obj.validate(value)
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 52, in validate
algo-1-tfcvc_1 | if (value[CONTENT_TYPE], value[TRAINING_INPUT_MODE], value[S3_DIST_TYPE]) not in self.supported:
algo-1-tfcvc_1 | KeyError: 'S3DistributionType'
本地 PC 运行时错误
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp71tx0fop/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
如果Jupyter notebook运行使用亚马逊云SageMaker环境(而不是在本地PC上),则没有错误。注意在云笔记本上运行ning时,session初始化为:
session = sagemaker.Session()
LocalSession()
与托管 docker 容器的配合方式似乎存在问题。
当 运行 SageMaker 在本地 Jupyter notebook 中时,它希望 Docker 容器在本地机器上也是 运行。
确保 SageMaker(运行 在本地笔记本中)使用 AWS 托管的 docker 容器的关键是在初始化 Estimator
.
错误
xgb_estimator = sagemaker.estimator.Estimator(
container, role, train_instance_count=1, train_instance_type=instance_type,
output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)
正确
xgb_estimator = sagemaker.estimator.Estimator(
container, role, train_instance_count=1, train_instance_type=instance_type,
output_path=f's3://{session.default_bucket()}/{prefix}/output')
附加信息
SageMaker Python SDK 源代码提供了以下有用的提示:
文件:sagemaker/local/local_session.py
class LocalSagemakerClient(object):
"""A SageMakerClient that implements the API calls locally.
Used for doing local training and hosting local endpoints. It still needs access to
a boto client to interact with S3 but it won't perform any SageMaker call.
...
文件:sagemaker/estimator.py
class EstimatorBase(with_metaclass(ABCMeta, object)):
"""Handle end-to-end Amazon SageMaker training and deployment tasks.
For introduction to model training and deployment, see
http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
Subclasses must define a way to determine what image to use for training,
what hyperparameters to use, and how to create an appropriate predictor instance.
"""
def __init__(self, role, train_instance_count, train_instance_type,
train_volume_size=30, train_max_run=24 * 60 * 60, input_mode='File',
output_path=None, output_kms_key=None, base_job_name=None, sagemaker_session=None, tags=None):
"""Initialize an ``EstimatorBase`` instance.
Args:
role (str): An AWS IAM role (either name or full ARN). ...
...
sagemaker_session (sagemaker.session.Session): Session object which manages interactions with
Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one
using the default AWS configuration chain.
"""
运行 本地 Jupyter 笔记本(使用 VS 代码)中的 SageMaker 可以正常工作,除了尝试使用 AWS 托管容器训练 XGBoost 模型会导致错误(容器名称:246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
).
Jupyter 笔记本
import sagemaker
session = sagemaker.LocalSession()
# Load and prepare the training and validation data
...
# Upload the training and validation data to S3
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
region = session.boto_region_name
instance_type = 'ml.m4.xlarge'
container = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1', 'py3', instance_type=instance_type)
role = 'arn:aws:iam::<USER ID #>:role/service-role/AmazonSageMaker-ExecutionRole-<ROLE ID #>'
xgb_estimator = sagemaker.estimator.Estimator(
container, role, train_instance_count=1, train_instance_type=instance_type,
output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)
xgb_estimator.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6,
subsample=0.8, objective='reg:squarederror', early_stopping_rounds=10,
num_round=200)
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')
xgb_estimator.fit({'train': s3_input_train, 'validation': s3_input_validation})
Docker 容器密钥错误
algo-1-tfcvc_1 | ERROR:sagemaker-containers:Reporting training FAILURE
algo-1-tfcvc_1 | ERROR:sagemaker-containers:framework error:
algo-1-tfcvc_1 | Traceback (most recent call last):
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
algo-1-tfcvc_1 | entrypoint()
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
algo-1-tfcvc_1 | train(framework.training_env())
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
algo-1-tfcvc_1 | run_algorithm_mode()
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
algo-1-tfcvc_1 | checkpoint_config=checkpoint_config
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 115, in sagemaker_train
algo-1-tfcvc_1 | validated_data_config = channels.validate(data_config)
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 106, in validate
algo-1-tfcvc_1 | channel_obj.validate(value)
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 52, in validate
algo-1-tfcvc_1 | if (value[CONTENT_TYPE], value[TRAINING_INPUT_MODE], value[S3_DIST_TYPE]) not in self.supported:
algo-1-tfcvc_1 | KeyError: 'S3DistributionType'
本地 PC 运行时错误
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp71tx0fop/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
如果Jupyter notebook运行使用亚马逊云SageMaker环境(而不是在本地PC上),则没有错误。注意在云笔记本上运行ning时,session初始化为:
session = sagemaker.Session()
LocalSession()
与托管 docker 容器的配合方式似乎存在问题。
当 运行 SageMaker 在本地 Jupyter notebook 中时,它希望 Docker 容器在本地机器上也是 运行。
确保 SageMaker(运行 在本地笔记本中)使用 AWS 托管的 docker 容器的关键是在初始化 Estimator
.
错误
xgb_estimator = sagemaker.estimator.Estimator(
container, role, train_instance_count=1, train_instance_type=instance_type,
output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)
正确
xgb_estimator = sagemaker.estimator.Estimator(
container, role, train_instance_count=1, train_instance_type=instance_type,
output_path=f's3://{session.default_bucket()}/{prefix}/output')
附加信息
SageMaker Python SDK 源代码提供了以下有用的提示:
文件:sagemaker/local/local_session.py
class LocalSagemakerClient(object):
"""A SageMakerClient that implements the API calls locally.
Used for doing local training and hosting local endpoints. It still needs access to
a boto client to interact with S3 but it won't perform any SageMaker call.
...
文件:sagemaker/estimator.py
class EstimatorBase(with_metaclass(ABCMeta, object)):
"""Handle end-to-end Amazon SageMaker training and deployment tasks.
For introduction to model training and deployment, see
http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
Subclasses must define a way to determine what image to use for training,
what hyperparameters to use, and how to create an appropriate predictor instance.
"""
def __init__(self, role, train_instance_count, train_instance_type,
train_volume_size=30, train_max_run=24 * 60 * 60, input_mode='File',
output_path=None, output_kms_key=None, base_job_name=None, sagemaker_session=None, tags=None):
"""Initialize an ``EstimatorBase`` instance.
Args:
role (str): An AWS IAM role (either name or full ARN). ...
...
sagemaker_session (sagemaker.session.Session): Session object which manages interactions with
Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one
using the default AWS configuration chain.
"""