如何使用 SageMaker Estimator 进行模型训练和保存

Question

关于如何使用 SageMaker 估计器的文档散落在各处，有时甚至是过时的、不正确的。是否有一个一站式位置可以全面介绍如何使用 SageMaker SDK Estimator 训练和保存模型？

Answer 1

回答

AWS 中没有这样的资源可以全面介绍如何使用 SageMaker SDK Estimator 训练和保存模型。

备选概览图

我放了一张图表和简要说明，以大致了解 SageMaker Estimator 如何运行训练。

SageMaker 为训练作业设置了一个 docker 容器，其中：
- 环境变量设置为SageMaker Docker Container. Environment Variables。
- 训练数据设置在/opt/ml/input/data下。
- 训练脚本代码设置在/opt/ml/code下。
- /opt/ml/model 和 /opt/ml/output 目录设置为存储训练输出。

/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json  <--- From Estimator hyperparameter arg
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>        <--- From Estimator fit method inputs arg
│           └── <input data>
├── code
│   └── <code files>              <--- From Estimator src_dir arg
├── model
│   └── <model files>             <--- Location to save the trained model artifacts
└── output
    └── failure                   <--- Training job failure logs

SageMaker Estimator fit(inputs) 方法执行训练脚本。 Estimator hyperparameters 和 fit 方法 inputs 作为其命令行参数提供。
训练完成后，训练脚本将模型工件保存在 /opt/ml/model 中。
SageMaker 将 /opt/ml/model 下的工件归档到 model.tar.gz 并将其保存到 output_path Estimator 参数指定的 S3 位置。
您可以设置 Estimator metric_definitions 参数以从训练日志中提取模型指标。然后您可以在 SageMaker 控制台指标中监控训练进度。

我认为 AWS 需要停止大量生产冗长、冗余、冗长、分散和过时的文档。 AWS需要了解一图胜千言.

在上下文中将图表和文档部分拼凑在一起，并明确objective实现。

问题

AWS 文档需要认真的重新设计和重新构建。仅仅为了理解如何训练和保存模型 就迫使我们阅读大量分散、零散、冗长、冗余的文档，这些文档通常是过时的、不完整的，有时甚至是不正确的。

在Why I think GCP is better than AWS中有很好的总结：

It’s not that AWS is harder to use than GCP, it’s that it is needlessly hard; a disjointed, sprawl of infrastructure primitives with poor cohesion between them.

A challenge is nice, a confusing mess is not, and the problem with AWS is that a large part of your working hours will be spent untangling their documentation and weeding through features and products to find what you want, rather than focusing on cool interesting challenges.

观看AI Simplified，看看 GCP AI 堆栈是多么简单易懂，而 SageMaker 又是多么丑陋。我强烈建议转向 GCP 并远离 SageMaker。 AWS不能自己记录的技术就没有未来。

特别是 SageMaker 团队在不更新文档的情况下不断更改实现。它的推出也不一致，例如SDK 版本 2 已在 SageMaker Studio 中推出，这使得 Github 中的 AWS 示例在未宣布的情况下不兼容。而 SageMaker 实例仍然有 SDK 1，因此代码在 Instance 中有效，但在 Studio 中无效。

令人难以置信，甚至疯狂的是，我们必须阅读下面的这么多文档才能了解如何使用 SageMaker SDK Estimator 进行训练。 AWS 想浪费开发人员多少时间？

模型训练文档

Train a Model with Amazon SageMaker

本文档提供了 20,000 英尺的 SageMaker 训练方式概览，但未给出任何要做什么的线索。

Running a container for Amazon SageMaker training

本文档概述了 SageMaker 训练的样子。但是，这不是最新的，因为它基于已过时的 SageMaker Containers。

WARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.

Step 4: Train a Model

本文档列出了训练步骤。

The Amazon SageMaker Python SDK provides framework estimators and generic estimators to train your model while orchestrating the machine learning (ML) lifecycle accessing the SageMaker features for training and the AWS infrastructures

Train a Model with the SageMaker Python SDK

To train a model by using the SageMaker Python SDK, you:

Prepare a training script

Create an estimator

Call the fit method of the estimator

最后这篇文档给出了具体的步骤和思路。但是，仍然缺少有关环境变量、SageMaker docker 容器** 中的目录结构、用于上传代码、放置数据的 S3、保存训练模型的 S3 等的全面详细信息。

Use TensorFlow with the SageMaker Python SDK

本文档重点介绍 TensorFlow Estimator 实施步骤。使用 Training a Tensorflow Model on MNIST Github 示例来跟随实际实现。

用于传递参数和数据位置的文档

How Amazon SageMaker Provides Training Information

This section explains how SageMaker makes training information, such as training data, hyperparameters, and other configuration information, available to your Docker container.

本文档最终给出了如何传递参数和数据的想法，但同样不全面。

SageMaker Docker Container Environment Variables

此文档被标记为已弃用，但它是唯一解释 SageMaker 环境变量的文档。

IMPORTANT ENVIRONMENT VARIABLES

SM_MODEL_DIR

SM_CHANNELS

SM_CHANNEL_{channel_name}

SM_HPS

SM_HP_{hyperparameter_name}

SM_CURRENT_HOST

SM_HOSTS

SM_NUM_GPUS

List of provided environment variables by SageMaker Containers

SM_NUM_CPUS

SM_LOG_LEVEL

SM_NETWORK_INTERFACE_NAME

SM_USER_ARGS

SM_INPUT_DIR

SM_INPUT_CONFIG_DIR

SM_OUTPUT_DATA_DIR

SM_RESOURCE_CONFIG

SM_INPUT_DATA_CONFIG

SM_TRAINING_ENV

SageMaker 文档Docker容器目录结构

Running a container for Amazon SageMaker training

/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
│   └── <model files>
└── output
    └── failure

The input

/opt/ml/input/config contains information to control how your program runs. hyperparameters.json is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. resourceConfig.json is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn’t support distributed training, we’ll ignore it here.

/opt/ml/input/data/<channel_name>/ (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it’s generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure.

/opt/ml/input/data/<channel_name>_<epoch_number> (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

The output

/opt/ml/model/ is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the DescribeTrainingJob result.

/opt/ml/output is a directory where the algorithm can write a file failure that describes why the job failed. The contents of this file will be returned in the FailureReason field of the DescribeTrainingJob result. For jobs that succeed, there is no reason to write this file as it will be ignored.

但是，这不是最新的，因为它基于已过时的 SageMaker Containers。

模型保存文件

有关训练模型的保存位置和格式的信息根本缺失。训练脚本需要将模型保存在/opt/ml/model下，格式和子目录结构取决于框架，例如TensorFlow、Pytorch。这是因为 SageMaker 部署使用依赖于框架的模型服务，例如。用于 TensorFlow 框架的 TensorFlow 服务。

这没有明确记录并造成混淆。开发者需要指定使用哪种格式，保存在哪个子目录下。

要使用 TensorFlow Estimator 训练和部署：

Deploy the trained model

Because we’re using TensorFlow Serving for deployment, our training script saves the model in TensorFlow’s SavedModel format.

amazon-sagemaker-examples/frameworks/tensorflow/code/train.py

    # Save the model
    # A version number is needed for the serving container
    # to load the model
    version = "00000000"
    ckpt_dir = os.path.join(args.model_dir, version)
    if not os.path.exists(ckpt_dir):
        os.makedirs(ckpt_dir)
    model.save(ckpt_dir)

代码将模型保存在 /opt/ml/model/00000000 中，因为这是用于 TensorFlow 服务。

Using the SavedModel format

The save-path follows a convention used by TensorFlow Serving where the last path component (1/ here) is a version number for your model - it allows tools like Tensorflow Serving to reason about the relative freshness.

Train and serve a TensorFlow model with TensorFlow Serving

To load our trained model into TensorFlow Serving we first need to save it in SavedModel format. This will create a protobuf file in a well-defined directory hierarchy, and will include a version number. TensorFlow Serving allows us to select which version of a model, or "servable" we want to use when we make inference requests. Each version will be exported to a different sub-directory under the given path.

API

的文档

基本上，SageMaker SDK Estimator 为训练部分实施了 CreateTrainingJob API。因此，更好地理解它是如何设计的以及需要定义哪些参数。否则使用 Estimator 就像在黑暗中行走。

例子

Jupyter 笔记本

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = sagemaker_session.default_bucket()

metric_definitions = [
    {"Name": "train:loss", "Regex": ".*loss: ([0-9\.]+) - accuracy: [0-9\.]+.*"},
    {"Name": "train:accuracy", "Regex": ".*loss: [0-9\.]+ - accuracy: ([0-9\.]+).*"},
    {
        "Name": "validation:accuracy",
        "Regex": ".*step - loss: [0-9\.]+ - accuracy: [0-9\.]+ - val_loss: [0-9\.]+ - val_accuracy: ([0-9\.]+).*",
    },
    {
        "Name": "validation:loss",
        "Regex": ".*step - loss: [0-9\.]+ - accuracy: [0-9\.]+ - val_loss: ([0-9\.]+) - val_accuracy: [0-9\.]+.*",
    },
    {
        "Name": "sec/sample",
        "Regex": ".* - \d+s (\d+)[mu]s/sample - loss: [0-9\.]+ - accuracy: [0-9\.]+ - val_loss: [0-9\.]+ - val_accuracy: [0-9\.]+",
    },
]

import uuid

checkpoint_s3_prefix = "checkpoints/{}".format(str(uuid.uuid4()))
checkpoint_s3_uri = "s3://{}/{}/".format(bucket, checkpoint_s3_prefix)

from sagemaker.tensorflow import TensorFlow

# --------------------------------------------------------------------------------
# 'trainingJobName' msut satisfy regular expression pattern: ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}
# --------------------------------------------------------------------------------
base_job_name = "fashion-mnist"
hyperparameters = {
    "epochs": 2, 
    "batch-size": 64
}
estimator = TensorFlow(
    entry_point="fashion_mnist.py",
    source_dir="src",
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    role=role,
    input_mode='File',
    framework_version="2.3.1",
    py_version="py37",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    base_job_name=base_job_name,
    checkpoint_s3_uri=checkpoint_s3_uri,
    model_dir=False
)
estimator.fit()

fashion_mnist.py

import os
import argparse
import json
import multiprocessing

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, BatchNormalization
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras import backend as K

print("TensorFlow version: {}".format(tf.__version__))
print("Eager execution is: {}".format(tf.executing_eagerly()))
print("Keras version: {}".format(tf.keras.__version__))


image_width = 28
image_height = 28


def load_data():
    fashion_mnist = tf.keras.datasets.fashion_mnist
    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

    number_of_classes = len(set(y_train))
    print("number_of_classes", number_of_classes)

    x_train = x_train / 255.0
    x_test = x_test / 255.0
    x_full = np.concatenate((x_train, x_test), axis=0)
    print(x_full.shape)

    print(type(x_train))
    print(x_train.shape)
    print(x_train.dtype)
    print(y_train.shape)
    print(y_train.dtype)

    # ## Train
    # * C: Convolution layer
    # * P: Pooling layer
    # * B: Batch normalization layer
    # * F: Fully connected layer
    # * O: Output fully connected softmax layer

    # Reshape data based on channels first / channels last strategy.
    # This is dependent on whether you use TF, Theano or CNTK as backend.
    # Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
    if K.image_data_format() == 'channels_first':
        x = x_train.reshape(x_train.shape[0], 1, image_width, image_height)
        x_test = x_test.reshape(x_test.shape[0], 1, image_width, image_height)
        input_shape = (1, image_width, image_height)
    else:
        x_train = x_train.reshape(x_train.shape[0], image_width, image_height, 1)
        x_test = x_test.reshape(x_test.shape[0], image_width, image_height, 1)
        input_shape = (image_width, image_height, 1)

    return x_train, y_train, x_test, y_test, input_shape, number_of_classes

# tensorboard --logdir=/full_path_to_your_logs

validation_split = 0.2
verbosity = 1
use_multiprocessing = True
workers = multiprocessing.cpu_count()


def train(model, x, y, args):
    # SavedModel Output
    tensorflow_saved_model_path = os.path.join(args.model_dir, "tensorflow/saved_model/0")
    os.makedirs(tensorflow_saved_model_path, exist_ok=True)

    # Tensorboard Logs
    tensorboard_logs_path = os.path.join(args.model_dir, "tensorboard/")
    os.makedirs(tensorboard_logs_path, exist_ok=True)

    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=tensorboard_logs_path,
        write_graph=True,
        write_images=True,
        histogram_freq=1,  # How often to log histogram visualizations
        embeddings_freq=1,  # How often to log embedding visualizations
        update_freq="epoch",
    )  # How often to write logs (default: once per epoch)

    model.compile(
        optimizer='adam',
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        metrics=['accuracy']
    )
    history = model.fit(
        x,
        y,
        shuffle=True,
        batch_size=args.batch_size,
        epochs=args.epochs,
        validation_split=validation_split,
        use_multiprocessing=use_multiprocessing,
        workers=workers,
        verbose=verbosity,
        callbacks=[
            tensorboard_callback
        ]
    )
    return history


def create_model(input_shape, number_of_classes):
    model = Sequential([
        Conv2D(
            name="conv01",
            filters=32,
            kernel_size=(3, 3),
            strides=(1, 1),
            padding="same",
            activation='relu',
            input_shape=input_shape
        ),
        MaxPooling2D(
            name="pool01",
            pool_size=(2, 2)
        ),
        Flatten(),  # 3D shape to 1D.
        BatchNormalization(
            name="batch_before_full01"
        ),
        Dense(
            name="full01",
            units=300,
            activation="relu"
        ),  # Fully connected layer
        Dense(
            name="output_softmax",
            units=number_of_classes,
            activation="softmax"
        )
    ])
    return model


def save_model(model, args):
    # Save the model
    # A version number is needed for the serving container
    # to load the model
    version = "00000000"
    model_save_dir = os.path.join(args.model_dir, version)
    if not os.path.exists(model_save_dir):
        os.makedirs(model_save_dir)
    print(f"saving model at {model_save_dir}")
    model.save(model_save_dir)


def parse_args():
    # --------------------------------------------------------------------------------
    # https://docs.python.org/dev/library/argparse.html#dest
    # --------------------------------------------------------------------------------
    parser = argparse.ArgumentParser()

    # --------------------------------------------------------------------------------
    # hyperparameters Estimator argument are passed as command-line arguments to the script.
    # --------------------------------------------------------------------------------
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--batch-size', type=int, default=64)

    # /opt/ml/model
    # sagemaker.tensorflow.estimator.TensorFlow override 'model_dir'.
    # See https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/\
    # sagemaker.tensorflow.html#sagemaker.tensorflow.estimator.TensorFlow
    parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])

    # /opt/ml/output
    parser.add_argument("--output_dir", type=str, default=os.environ["SM_OUTPUT_DIR"])

    args = parser.parse_args()
    return args


if __name__ == "__main__":
    args = parse_args()
    print("---------- key/value args")
    for key, value in vars(args).items():
        print(f"{key}:{value}")

    x_train, y_train, x_test, y_test, input_shape, number_of_classes = load_data()
    model = create_model(input_shape, number_of_classes)

    history = train(model=model, x=x_train, y=y_train, args=args)
    print(history)
    
    save_model(model, args)
    results = model.evaluate(x_test, y_test, batch_size=100)
    print("test loss, test accuracy:", results)

SageMaker 控制台

笔记本输出

2021-09-03 03:02:04 Starting - Starting the training job...
2021-09-03 03:02:16 Starting - Launching requested ML instancesProfilerReport-1630638122: InProgress
......
2021-09-03 03:03:17 Starting - Preparing the instances for training.........
2021-09-03 03:04:59 Downloading - Downloading input data
2021-09-03 03:04:59 Training - Downloading the training image...
2021-09-03 03:05:23 Training - Training image download completed. Training in progress.2021-09-03 03:05:23.966037: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-09-03 03:05:23.969704: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2021-09-03 03:05:24.118054: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-09-03 03:05:26,842 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training
2021-09-03 03:05:26,852 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:27,734 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/usr/local/bin/python3.7 -m pip install -r requirements.txt
WARNING: You are using pip version 21.0.1; however, version 21.2.4 is available.
You should consider upgrading via the '/usr/local/bin/python3.7 -m pip install --upgrade pip' command.

2021-09-03 03:05:29,028 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,045 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,062 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,072 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "batch-size": 64,
        "epochs": 2
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "fashion-mnist-2021-09-03-03-02-02-305",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz",
    "module_name": "fashion_mnist",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "fashion_mnist.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch-size":64,"epochs":2}
SM_USER_ENTRY_POINT=fashion_mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=fashion_mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch-size":64,"epochs":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"fashion-mnist-2021-09-03-03-02-02-305","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz","module_name":"fashion_mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"fashion_mnist.py"}
SM_USER_ARGS=["--batch-size","64","--epochs","2"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_BATCH-SIZE=64
SM_HP_EPOCHS=2
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages

Invoking script with the following command:

/usr/local/bin/python3.7 fashion_mnist.py --batch-size 64 --epochs 2


TensorFlow version: 2.3.1
Eager execution is: True
Keras version: 2.4.0
---------- key/value args
epochs:2
batch_size:64
model_dir:/opt/ml/model
output_dir:/opt/ml/output

如何使用 SageMaker Estimator 进行模型训练和保存

How to use SageMaker Estimator for model training and saving

amazon-web-services

amazon-sagemaker

回答

备选概览图

问题

模型训练文档

用于传递参数和数据位置的文档

IMPORTANT ENVIRONMENT VARIABLES

List of provided environment variables by SageMaker Containers