如何验证我的训练作业是否正在读取增强清单文件？

Question

很抱歉post。

最初，我在 S3 存储桶的一个位置上有数据，并使用典型的 'File' 模式在该数据上训练深度学习图像分类模型，并传递存储数据的 S3 uri 作为训练输入。为了尝试加速训练，我想改用：

管道模式，流式传输数据而不是在训练开始时下载所有数据，更快地开始训练并节省磁盘space。
增强的清单文件加上 1.，这样我就不必将数据放在 S3 上的一个位置，这样我就可以避免在训练模型时四处移动数据。

我正在制作类似于 the one in this example 的脚本。我打印了解析数据时完成的步骤，但是我注意到数据可能没有被读取，因为打印时显示如下：

step 1 Tensor("ParseSingleExample/ParseExample/ParseExampleV2:0", shape=(), dtype=string)
step 2 Tensor("DecodePng:0", shape=(None, None, 3), dtype=uint8)
step 3 Tensor("Cast:0", shape=(None, None, 3), dtype=float32)

我猜图像不是 read/found 因为形状是 [None, None, 3] 而它应该是 [224, 224, 3]，所以问题可能来自 Augmented Manifest 文件？

下面是我的 Augmented Manifest 文件的编写示例：

{"image-ref": "s3://path/to/my/image/image1.png", "label": 1}
{"image-ref": "s3://path/to/my/image/image2.png", "label": 2}
{"image-ref": "s3://path/to/my/image/image3.png", "label": 3}

我可能应该提到的其他一些细节：

当我创建训练输入时，我通过 'content_type': 'application/x-recordio', 'record_wrapping': 'RecordIO'，即使我的数据是 .png 格式，但我假设在读取增强清单文件时，数据被包装在 RecordIO 格式中。
根据我的第一点，我通过了 PipeModeDataset(channel=channel, record_format='RecordIO')，所以也不确定 RecordIO 的事情。

没有出现实际错误，就在我开始拟合模型时没有任何反应，它一直在运行但实际上没有任何运行，所以我试图找出问题所在。

编辑：它现在可以正确读取形状，但仍然存在进入 .fit 方法但什么都不做的问题，只是保持运行不做任何事情。在下面找到部分脚本。

def train_input_fn(train_channel):
    """Returns input function that feeds the model during training"""
    return _input_fn(train_channel)

def _input_fn(channel):
    """
        Returns a Dataset which reads from a SageMaker PipeMode channel.
    """
    
    features = {
        'image-ref': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([3], tf.int64),
    }
 
    def combine(records):
        return records[0], records[1]
 
    def parse(record):
        
        parsed = tf.io.parse_single_example(record, features)
        
                 

        image = tf.io.decode_png(parsed["image-ref"], channels=3, dtype=tf.uint8)
        image = tf.reshape(image, [224, 224, 3])
        
        lbl = parsed['label']
        print(image, lbl)
        return (image, lbl)
 
    ds = PipeModeDataset(channel=channel, record_format='RecordIO')
    ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
    ds = ds.prefetch(AUTOTUNE)
 
    return ds

def model(dataset):
    """Generate a simple model"""
    inputs = Input(shape=(224, 224, 3))
    prediction_layer = Dense(2, activation = 'softmax')


    x = inputs
    x = tf.keras.applications.mobilenet.MobileNet(include_top=False, input_shape=(224,224,3), weights='imagenet')(x)
    outputs = prediction_layer(x)
    rec_model = tf.keras.Model(inputs, outputs)    
    
    rec_model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
        metrics=['accuracy']
    )
    
    
    rec_model.fit(
        dataset
    )

    return rec_model

def main(params):
    
    epochs = params['epochs']
    train_channel = params['train_channel']
    record_format = params['record_format']
    batch_size = params['batch_size']
        
    train_spec = train_input_fn(train_channel)
    model_classifier = model(train_spec)

Answer 1

来自here：

A PipeModeDataset can read TFRecord, RecordIO, or text line records.

当您尝试读取二进制 (PNG) 文件时。我没有看到相关的 record reader here 来帮助您做到这一点。
您可以构建自己的格式管道实现，如图所示 here，但要付出更多的努力。

或者，您提到您的文件分散在不同的文件夹中，但如果您的文件公共路径包含少于 2M 的文件，您可以使用 FastFile mode 到流数据.目前，FastFile 仅支持 S3 前缀，因此您将无法使用清单。

另请参阅此 general pros/cons discussion of the different available storage and input types available in SageMaker。

如何验证我的训练作业是否正在读取增强清单文件？

How can I verify that my training job is reading the augmented manifest file?

python

manifest

amazon-s3

tensorflow

amazon-sagemaker