带有tensorflow 2的Sagemaker不保存模型

Question

我正在使用 Keras，我正在尝试使用 Sagemaker 训练模型。我有以下问题：当我使用 TensorFlow 1.12 训练我的模型时，一切正常：

estimator = TensorFlow(entry_point='entrypoint-2.py',
                            base_job_name='mlearning-test',
                         role=role,
                         train_instance_count=1,
                         input_mode='Pipe',
                         train_instance_type='ml.p2.xlarge',
                         framework_version='1.12.0')

我的模型已经训练完成，模型保存在S3中。没问题。

但是，如果我将框架版本更改为 2.0.0

estimator = TensorFlow(entry_point='entrypoint-2.py',
                                base_job_name='mlearning-test',
                             role=role,
                             train_instance_count=1,
                             input_mode='Pipe',
                             train_instance_type='ml.p2.xlarge',
                             framework_version='2.0.0')

我收到以下错误：

2020-02-12 13:54:36,601 sagemaker_tensorflow_container.training WARNING  No model artifact is saved under path /opt/ml/model. Your training job will not save any model files to S3.
For details of how to construct your training script see:
https://sagemaker.readthedocs.io/en/stable/using_tf.html#adapting-your-local-tensorflow-script

训练作业被标记为成功，但 S3 存储桶中没有任何内容，确实没有训练。

作为替代方案，我尝试放置 py_version='py3'，但这种情况不断发生。在 sagemaker 上使用 TF2 时，有什么我不知道的主要区别吗？

我认为不需要入口点，因为它在 1.12 版中运行良好，但如果您好奇或可以在这里发现一些东西，它是：

import tensorflow as tf
from sagemaker_tensorflow import PipeModeDataset
#from tensorflow.contrib.data import map_and_batch

INPUT_TENSOR_NAME = 'inputs_input'  
BATCH_SIZE = 64
NUM_CLASSES = 5
BUFFER_SIZE = 50
PREFETCH_SIZE = 1
LENGHT = 512
SEED = 26
EPOCHS = 1
WIDTH = 512

def keras_model_fn(hyperparameters):
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(WIDTH, 'relu', input_shape=(None, WIDTH), name = 'inputs'),
        #tf.keras.layers.InputLayer(input_shape=(None, WIDTH), name=INPUT_TENSOR_NAME),
        tf.keras.layers.Dense(256, 'relu'),
        tf.keras.layers.Dense(128, 'relu'),
        tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
    ])

    opt = tf.keras.optimizers.RMSprop()

    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=["accuracy"])
    return model

def serving_input_fn(hyperparameters):
    # Notice that the input placeholder has the same input shape as the Keras model input
    tensor = tf.placeholder(tf.float32, shape=[None, WIDTH])

    # The inputs key INPUT_TENSOR_NAME matches the Keras InputLayer name
    inputs = {INPUT_TENSOR_NAME: tensor}
    return tf.estimator.export.ServingInputReceiver(inputs, inputs)

def train_input_fn(training_dir, params):
    """Returns input function that would feed the model during training"""
    return _input_fn('train')

def eval_input_fn(training_dir, params):
    """Returns input function that would feed the model during evaluation"""
    return _input_fn('eval')

def _input_fn(channel):
    """Returns a Dataset for reading from a SageMaker PipeMode channel."""
    print("DATA "+channel)
    features={
        'question': tf.FixedLenFeature([WIDTH], tf.float32),
        'label': tf.FixedLenFeature([1], tf.int64)
    }

    def parse(record):
        parsed = tf.parse_single_example(record, features)
        #print("-------->"+str(tf.cast(parsed['question'], tf.float32))
        return {
            INPUT_TENSOR_NAME: tf.cast(parsed['question'], tf.float32)
        }, parsed['label']

    ds = PipeModeDataset(channel)
    if EPOCHS > 1:
        ds = ds.repeat(EPOCHS)
    ds = ds.prefetch(PREFETCH_SIZE)
    #ds = ds.apply(map_and_batch(parse, batch_size=BATCH_SIZE,
    #                            num_parallel_batches=BUFFER_SIZE))
    ds = ds.map(parse, num_parallel_calls=NUM_PARALLEL_BATCHES)
    ds = ds.batch(BATCH_SIZE)

    return ds

Answer 1

你是对的，去年 SageMaker TensorFlow 体验发生了一个重大的有益变化，名为 脚本模式 形式主义 .正如您在 SDK Documentation:

中看到的

“警告。我们在 TensorFlow 版本 1.11 中添加了一种新格式的 TensorFlow 训练脚本。这种新方式为用户脚本提供了更大的灵活性。这种新格式称为脚本模式，与传统模式相反，这是我们支持 TensorFlow 1.11 和旧版本的模式。此外，我们正在添加 Python 3 对脚本模式的支持。 Legacy Mode 的最后一个支持版本是 TensorFlow 1.12。脚本模式适用于 TensorFlow 1.11 及更新版本。确保在准备脚本时参考了本自述文件的正确版本。您可以在此处找到传统模式自述文件。"

使用 TensorFlow 2，您需要遵循 Script Mode 形式并将您的模型保存在 opt/ml/model 路径中，否则不会向 S3 发送任何内容。 脚本模式实现起来非常简单，并提供更好的灵活性和可移植性，并且此规范与 SageMaker Sklearn 容器、SageMaker Pytorch 容器和 SageMaker MXNet 容器共享，因此绝对值得采用

带有tensorflow 2的Sagemaker不保存模型

Sagemaker with tensorflow 2 not saving model

keras

tensorflow

amazon-sagemaker