SageMaker in local Jupyter notebook: cannot use AWS hosted XGBoost container ("KeyError: 'S3DistributionType'" and "Failed to run: ['docker-compose'")

SageMaker in local Jupyter notebook: cannot use AWS hosted XGBoost container ("KeyError: 'S3DistributionType'" and "Failed to run: ['docker-compose'")

运行 本地 Jupyter 笔记本(使用 VS 代码)中的 SageMaker 可以正常工作,除了尝试使用 AWS 托管容器训练 XGBoost 模型会导致错误(容器名称:246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3 ).

Jupyter 笔记本

import sagemaker

session = sagemaker.LocalSession()

# Load and prepare the training and validation data
...

# Upload the training and validation data to S3
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

region = session.boto_region_name
instance_type = 'ml.m4.xlarge'
container = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1', 'py3', instance_type=instance_type)

role = 'arn:aws:iam::<USER ID #>:role/service-role/AmazonSageMaker-ExecutionRole-<ROLE ID #>'

xgb_estimator = sagemaker.estimator.Estimator(
    container, role, train_instance_count=1, train_instance_type=instance_type,
    output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)

xgb_estimator.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6,
                                  subsample=0.8, objective='reg:squarederror', early_stopping_rounds=10,
                                  num_round=200)

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')

xgb_estimator.fit({'train': s3_input_train, 'validation': s3_input_validation})

Docker 容器密钥错误

algo-1-tfcvc_1  | ERROR:sagemaker-containers:Reporting training FAILURE
algo-1-tfcvc_1  | ERROR:sagemaker-containers:framework error: 
algo-1-tfcvc_1  | Traceback (most recent call last):
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
algo-1-tfcvc_1  |     entrypoint()
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
algo-1-tfcvc_1  |     train(framework.training_env())
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
algo-1-tfcvc_1  |     run_algorithm_mode()
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
algo-1-tfcvc_1  |     checkpoint_config=checkpoint_config
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 115, in sagemaker_train
algo-1-tfcvc_1  |     validated_data_config = channels.validate(data_config)
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 106, in validate
algo-1-tfcvc_1  |     channel_obj.validate(value)
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 52, in validate
algo-1-tfcvc_1  |     if (value[CONTENT_TYPE], value[TRAINING_INPUT_MODE], value[S3_DIST_TYPE]) not in self.supported:
algo-1-tfcvc_1  | KeyError: 'S3DistributionType'

本地 PC 运行时错误

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp71tx0fop/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

如果Jupyter notebook运行使用亚马逊云SageMaker环境(而不是在本地PC上),则没有错误。注意在云笔记本上运行ning时,session初始化为:

session = sagemaker.Session()

LocalSession() 与托管 docker 容器的配合方式似乎存在问题。

当 运行 SageMaker 在本地 Jupyter notebook 中时,它希望 Docker 容器在本地机器上也是 运行。

确保 SageMaker(运行 在本地笔记本中)使用 AWS 托管的 docker 容器的关键是在初始化 Estimator.

错误

xgb_estimator = sagemaker.estimator.Estimator(
    container, role, train_instance_count=1, train_instance_type=instance_type,
    output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)

正确

xgb_estimator = sagemaker.estimator.Estimator(
    container, role, train_instance_count=1, train_instance_type=instance_type,
    output_path=f's3://{session.default_bucket()}/{prefix}/output')

附加信息

SageMaker Python SDK 源代码提供了以下有用的提示:

文件:sagemaker/local/local_session.py

class LocalSagemakerClient(object):
    """A SageMakerClient that implements the API calls locally.

    Used for doing local training and hosting local endpoints. It still needs access to
    a boto client to interact with S3 but it won't perform any SageMaker call.
    ...

文件:sagemaker/estimator.py

class EstimatorBase(with_metaclass(ABCMeta, object)):
    """Handle end-to-end Amazon SageMaker training and deployment tasks.

    For introduction to model training and deployment, see
    http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

    Subclasses must define a way to determine what image to use for training,
    what hyperparameters to use, and how to create an appropriate predictor instance.
    """

    def __init__(self, role, train_instance_count, train_instance_type,
                 train_volume_size=30, train_max_run=24 * 60 * 60, input_mode='File',
                 output_path=None, output_kms_key=None, base_job_name=None, sagemaker_session=None, tags=None):
        """Initialize an ``EstimatorBase`` instance.

        Args:
            role (str): An AWS IAM role (either name or full ARN). ...
            
        ...

            sagemaker_session (sagemaker.session.Session): Session object which manages interactions with
                Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one
                using the default AWS configuration chain.
        """