'batchStrategy: MultiRecord' 的 Sagemaker 批量转换作业失败以及数据处理

Sagemaker batch transform job failure for 'batchStrategy: MultiRecord' along with data processing

我们正在使用 SageMaker Batch Transform 作业,并在 MaxPayloadInMB 限制内将尽可能多的记录放入小批量中,我们将 BatchStrategy 设置为 MultiRecord 并且SplitTypeLine.

SageMaker 批量转换作业的输入是:

{"requestBody": {"data": {"Age": 90, "Experience": 26, "Income": 30, "Family": 3, "CCAvg": 1}}, "mName": "loanprediction", "mVersion": "1", "testFlag": "false", "environment": "DEV", "transactionId": "5-687sdf87-0bc7e3cb3454dbf261ed1353", "timestamp": "2022-01-15T01:45:32.955Z"}
{"requestBody": {"data": {"Age": 55, "Experience": 26, "Income": 450, "Family": 3, "CCAvg": 1}}, "mName": "loanprediction", "mVersion": "1", "testFlag": "false", "environment": "DEV", "transactionId": "5-69e22778-594916685f4ceca66c08bfbc", "timestamp": "2022-01-15T01:46:32.386Z"}

这是 SageMaker 批量转换作业配置:

apiVersion: sagemaker.aws.amazon.com/v1
kind: BatchTransformJob
metadata:
        generateName: '...-batchtransform'
spec:
        batchStrategy: MultiRecord
        dataProcessing:
                JoinSource: Input
                OutputFilter: $
                inputFilter: $.requestBody
        modelClientConfig:
                invocationsMaxRetries: 0
                invocationsTimeoutInSeconds: 3
        mName: '..'
        region: us-west-2
        transformInput:
                contentType: application/json
                dataSource:
                        s3DataSource:
                                s3DataType: S3Prefix
                                s3Uri: s3://....../part-
                splitType: Line
        transformOutput:
                accept: application/json
                assembleWith: Line
                kmsKeyId: '....'
                s3OutputPath: s3://..../batch_output
        transformResources:
                instanceCount: ..
                instanceType: '..'

SageMaker 批量转换作业失败并显示:

批量转换数据日志出错 -

2022-01-27T00:55:39.781:[sagemaker logs]: ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:

2022-01-27T00:55:39.781:[sagemaker logs]: ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:

400 Bad Request 2022-01-27T00:55:39.781:[sagemaker logs]: ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:

Failed to decode JSON object: Extra data: line 2 column 1 (char 163)

观察: 当我们在清单中提供 batchStrategy: MultiRecord 以及这些数据处理配置时,会出现此问题:

dataProcessing:
        JoinSource: Input
        OutputFilter: $
        inputFilter: $.requestBody

注意: 如果我们将 batchStrategy: SingleRecord 与上述数据处理配置一起放置,它就可以正常工作(工作成功)!

问题:我们如何使用batchStrategy: MultiRecord以及上述数据处理配置来成功运行?

batchStrategy: SingleRecord 的成功输出如下所示:

{"SageMakerOutput":{"prediction":0},"environment":"DEV","transactionId":"5-687sdf87-0bc7e3cb3454dbf261ed1353","mName":"loanprediction","mVersion":"1","requestBody":{"data":{"Age":90,"CCAvg":1,"Experience":26,"Family":3,"Income":30}},"testFlag":"false","timestamp":"2022-01-15T01:45:32.955Z"} {"SageMakerOutput":{"prediction":0},"environment":"DEV","transactionId":"5-69e22778-594916685f4ceca66c08bfbc","mName":"loanprediction","mVersion":"1","requestBody":{"data":{"Age":55,"CCAvg":1,"Experience":26,"Family":3,"Income":450}},"testFlag":"false","timestamp":"2022-01-15T01:46:32.386Z"} Region name – optional: Relevant resource ARN – optional: arn:aws:sagemaker:us-west-2:435945521637:transform-job/my-pipeline-9v28r-bat-e548fbfb125946528957e0f123456789

在使用 MultiRecord 时,您不能使用 dataProcessing 将输入与预测结合起来。在批处理作业之后,您需要分别手动合并输入和预测。 MultiRecord 策略批量处理记录,最大大小在 MaxPayloadInMB 中指定。

对于输入:

input1 

input2

input3

input4

input5

input6

输出格式如下:

output1,output2,output3

output3,output4,output6

您需要处理输出文件以与输入数据合并以获得所需的结果。我希望使用 json 数组进行预测,因为您的输出格式是 json。您可以分解数组并将其与输入合并。分解时需要保持预测的顺序。您可以在以下位置找到更多详细信息:

https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_batch_transform/introduction_to_batch_transform/batch_transform_pca_dbscan_movie_clusters.html#Batch-prediction-on-new-data

当您的输入数据采用 JSON 行格式并且您选择 SingleRecord BatchStrategy 时,您的容器将收到一个 JSON 有效负载主体,如下所示

{ <some JSON data> }

但是,如果您使用 MultiRecord,批处理转换会将您的 JSON 行输入(例如可能包含 100 行)拆分为多条记录(例如 10 条记录),所有记录一次发送到您的容器,如图所示下面:

{ <some JSON data> }
{ <some JSON data> }
{ <some JSON data> }
{ <some JSON data> }
.
.
.
{ <some JSON data> }

因此您的容器应该能够处理此类输入才能正常工作。但是,从错误消息中,我可以看到它在读取请求的第二行时抱怨 JSON 格式无效。

我还注意到您提供的 ContentTypeAcceptType 作为 application/json 但应该是 application/jsonlines

能否请您测试一下您的容器,看看它是否可以在每次调用时处理多个 JSON 行记录。