在没有 .lst 文件的情况下将 s3 中的图像与 SageMaker 结合使用

Question

我正在尝试创建（我认为是）一个简单的图像 classs3 和 SageMaker 之间的化管道。

图像存储在 s3 存储桶中，当前文件名中带有 class 标签，例如

我的-s3-bucket-dir

cat-1.jpg
dog-1.jpg
cat-2.jpg
..

我一直在尝试利用几个相关的示例 .py 脚本，但大多数似乎都是以 .rec 格式下载的数据集，或者包含我没有的特殊清单或注释文件。

我只想将图像从 s3 传递到位于同一区域的 SageMaker 图像 classification 算法、IAM 帐户等。我想这意味着我需要 .lst文件

当我尝试手动创建 .lst 时，它似乎并不喜欢它，而且手动工作花费的时间太长，这不是一个好习惯。

如何自动生成 .lst 文件（或者发送 images/classes 进行训练）？

我读到的东西听起来像是 im2rec.py 是一个解决方案，但我不知道如何解决。我现在使用的示例是

Image-classification-fulltraining-highlevel.ipynb

但它似乎将数据下载为 .rec、

download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')

它只是跳过了对 .jpeg 文件的处理。我发现另一个将它们转换为 .rec 但它基本上已经将 .lst 转换为 .json 并且只是转换它。

我大部分时间都在 AWS 控制台（在我的浏览器中）中的 Python Jupyter notebook 中工作，但我也尝试过使用他们的 GUI。

如何在不手动创建 .lst 文件的情况下简单地自动生成 .lst 或以其他方式将 data/class 信息导入 SageMaker？

更新

看来 im2py 不能运行对抗 s3。您必须将所有 s3 存储桶中的所有内容完全下载到笔记本的存储中...

Please note that [...] im2rec.py is running locally, therefore cannot take input from the S3 bucket. To generate the list file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team

Answer 1

有 3 个选项可以为图像分类算法提供注释数据：(1) 在 recordIO 文件中打包标签，(2) 在 JSON 清单文件中存储标签（"augmented manifest" 选项） , (3) 将标签存储在列表文件中。所有选项都记录在此处：https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.

Augmented Manifest 和 .lst 文件选项很快就可以完成，因为它们只需要您创建一个带有通常快速 for 循环的注释文件。 RecordIO 要求您使用 im2rec.py 工具，这会多一些工作。

使用 .lst 文件是另一种选择，它相当简单：您只需要使用快速循环为它们创建注释，如下所示：

# assuming train_index, train_class, train_pics store the pic index, class and path

with open('train.lst', 'a') as file:
    for index, cl, pic in zip(train_index, train_class, train_pics):
        file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')

在没有 .lst 文件的情况下将 s3 中的图像与 SageMaker 结合使用

Use images in s3 with SageMaker without .lst files

amazon-s3

computer-vision

python-3.x

amazon-sagemaker