Sagemaker 使用经过处理的 pickled ndarray 而不是来自 S3 的 csv 文件

Question

我知道您可以使用以下代码将 CSV 文件从 S3 传递到 Sagemaker XGBoost 容器中

train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')
valid_channel = sagemaker.session.s3_input(validation_data, content_type='text/csv')

data_channels = {'train': train_channel, 'validation': valid_channel}
xgb_model.fit(inputs=data_channels,  logs=True)

但我在 S3 存储桶中存储了一个 ndArray。这些是经过处理的、标签编码的、特征工程阵列。我想将其传递到容器中而不是 csv。我知道我总是可以在将 ndarray 保存到 S3 之前将其转换为 csv 文件。只是检查是否有数组选项。

Answer 1

SageMaker 中有多个算法选项：

Built-in algorithms，喜欢你提到的 SageMaker XGBoost
自定义的、用户创建的算法代码，可以是：
- 为预建 docker 图像编写，可用于 Sklearn、TensorFlow、Pytorch、MXNet
- 写在自己的容器里

当您使用内置插件（选项 1）时，您对数据格式选项的选择仅限于内置插件支持的内容，which is only csv and libsvm in the case of the built-in XGBoost. If you want to use custom data formats and pre-processing logic before XGBoost, it is absolutely possible if you use your own script leveraging the open-source XGBoost. You can get inspiration from the Random Forest demo了解如何在预构建容器中创建自定义模型

Sagemaker 使用经过处理的 pickled ndarray 而不是来自 S3 的 csv 文件

Sagemaker to use processed pickled ndarray instead of csv files from S3

python

amazon-s3

amazon-web-services

amazon-sagemaker