SageMaker in local Jupyter notebook: cannot use AWS hosted XGBoost container ("KeyError: 'S3DistributionType'" and "Failed to run: ['docker-compose'")

Question

运行本地 Jupyter 笔记本（使用 VS 代码）中的 SageMaker 可以正常工作，除了尝试使用 AWS 托管容器训练 XGBoost 模型会导致错误（容器名称：246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3 ).

Jupyter 笔记本

import sagemaker

session = sagemaker.LocalSession()

# Load and prepare the training and validation data
...

# Upload the training and validation data to S3
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

region = session.boto_region_name
instance_type = 'ml.m4.xlarge'
container = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1', 'py3', instance_type=instance_type)

role = 'arn:aws:iam::<USER ID #>:role/service-role/AmazonSageMaker-ExecutionRole-<ROLE ID #>'

xgb_estimator = sagemaker.estimator.Estimator(
    container, role, train_instance_count=1, train_instance_type=instance_type,
    output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)

xgb_estimator.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6,
                                  subsample=0.8, objective='reg:squarederror', early_stopping_rounds=10,
                                  num_round=200)

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')

xgb_estimator.fit({'train': s3_input_train, 'validation': s3_input_validation})

Docker 容器密钥错误

algo-1-tfcvc_1  | ERROR:sagemaker-containers:Reporting training FAILURE
algo-1-tfcvc_1  | ERROR:sagemaker-containers:framework error: 
algo-1-tfcvc_1  | Traceback (most recent call last):
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
algo-1-tfcvc_1  |     entrypoint()
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
algo-1-tfcvc_1  |     train(framework.training_env())
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
algo-1-tfcvc_1  |     run_algorithm_mode()
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
algo-1-tfcvc_1  |     checkpoint_config=checkpoint_config
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 115, in sagemaker_train
algo-1-tfcvc_1  |     validated_data_config = channels.validate(data_config)
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 106, in validate
algo-1-tfcvc_1  |     channel_obj.validate(value)
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 52, in validate
algo-1-tfcvc_1  |     if (value[CONTENT_TYPE], value[TRAINING_INPUT_MODE], value[S3_DIST_TYPE]) not in self.supported:
algo-1-tfcvc_1  | KeyError: 'S3DistributionType'

本地 PC 运行时错误

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp71tx0fop/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

如果Jupyter notebook运行使用亚马逊云SageMaker环境（而不是在本地PC上），则没有错误。注意在云笔记本上运行ning时，session初始化为：

session = sagemaker.Session()

LocalSession() 与托管 docker 容器的配合方式似乎存在问题。

Answer 1

当运行 SageMaker 在本地 Jupyter notebook 中时，它希望 Docker 容器在本地机器上也是运行。

确保 SageMaker（运行在本地笔记本中）使用 AWS 托管的 docker 容器的关键是在初始化 Estimator.

错误

xgb_estimator = sagemaker.estimator.Estimator(
    container, role, train_instance_count=1, train_instance_type=instance_type,
    output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)

正确

xgb_estimator = sagemaker.estimator.Estimator(
    container, role, train_instance_count=1, train_instance_type=instance_type,
    output_path=f's3://{session.default_bucket()}/{prefix}/output')

附加信息

SageMaker Python SDK 源代码提供了以下有用的提示：

文件：sagemaker/local/local_session.py

class LocalSagemakerClient(object):
    """A SageMakerClient that implements the API calls locally.

    Used for doing local training and hosting local endpoints. It still needs access to
    a boto client to interact with S3 but it won't perform any SageMaker call.
    ...

文件：sagemaker/estimator.py

class EstimatorBase(with_metaclass(ABCMeta, object)):
    """Handle end-to-end Amazon SageMaker training and deployment tasks.

    For introduction to model training and deployment, see
    http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

    Subclasses must define a way to determine what image to use for training,
    what hyperparameters to use, and how to create an appropriate predictor instance.
    """

    def __init__(self, role, train_instance_count, train_instance_type,
                 train_volume_size=30, train_max_run=24 * 60 * 60, input_mode='File',
                 output_path=None, output_kms_key=None, base_job_name=None, sagemaker_session=None, tags=None):
        """Initialize an ``EstimatorBase`` instance.

        Args:
            role (str): An AWS IAM role (either name or full ARN). ...
            
        ...

            sagemaker_session (sagemaker.session.Session): Session object which manages interactions with
                Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one
                using the default AWS configuration chain.
        """

SageMaker in local Jupyter notebook: cannot use AWS hosted XGBoost container ("KeyError: 'S3DistributionType'" and "Failed to run: ['docker-compose'")

SageMaker in local Jupyter notebook: cannot use AWS hosted XGBoost container ("KeyError: 'S3DistributionType'" and "Failed to run: ['docker-compose'")

python

docker

xgboost

jupyter-notebook

amazon-sagemaker

Jupyter 笔记本

Docker 容器密钥错误

本地 PC 运行时错误

错误

正确

附加信息

文件：sagemaker/local/local_session.py

文件：sagemaker/estimator.py