'batchStrategy: MultiRecord' 的 Sagemaker 批量转换作业失败以及数据处理
Sagemaker batch transform job failure for 'batchStrategy: MultiRecord' along with data processing
我们正在使用 SageMaker Batch Transform 作业,并在 MaxPayloadInMB
限制内将尽可能多的记录放入小批量中,我们将 BatchStrategy
设置为 MultiRecord
并且SplitType
到 Line
.
SageMaker 批量转换作业的输入是:
{"requestBody": {"data": {"Age": 90, "Experience": 26, "Income": 30, "Family": 3, "CCAvg": 1}}, "mName": "loanprediction", "mVersion": "1", "testFlag": "false", "environment": "DEV", "transactionId": "5-687sdf87-0bc7e3cb3454dbf261ed1353", "timestamp": "2022-01-15T01:45:32.955Z"}
{"requestBody": {"data": {"Age": 55, "Experience": 26, "Income": 450, "Family": 3, "CCAvg": 1}}, "mName": "loanprediction", "mVersion": "1", "testFlag": "false", "environment": "DEV", "transactionId": "5-69e22778-594916685f4ceca66c08bfbc", "timestamp": "2022-01-15T01:46:32.386Z"}
这是 SageMaker 批量转换作业配置:
apiVersion: sagemaker.aws.amazon.com/v1
kind: BatchTransformJob
metadata:
generateName: '...-batchtransform'
spec:
batchStrategy: MultiRecord
dataProcessing:
JoinSource: Input
OutputFilter: $
inputFilter: $.requestBody
modelClientConfig:
invocationsMaxRetries: 0
invocationsTimeoutInSeconds: 3
mName: '..'
region: us-west-2
transformInput:
contentType: application/json
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3Uri: s3://....../part-
splitType: Line
transformOutput:
accept: application/json
assembleWith: Line
kmsKeyId: '....'
s3OutputPath: s3://..../batch_output
transformResources:
instanceCount: ..
instanceType: '..'
SageMaker 批量转换作业失败并显示:
批量转换数据日志出错 -
2022-01-27T00:55:39.781:[sagemaker logs]:
ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:
2022-01-27T00:55:39.781:[sagemaker logs]:
ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:
400 Bad Request 2022-01-27T00:55:39.781:[sagemaker
logs]:
ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:
Failed to decode JSON object: Extra data: line 2 column 1 (char
163)
观察:
当我们在清单中提供 batchStrategy: MultiRecord
以及这些数据处理配置时,会出现此问题:
dataProcessing:
JoinSource: Input
OutputFilter: $
inputFilter: $.requestBody
注意: 如果我们将 batchStrategy: SingleRecord
与上述数据处理配置一起放置,它就可以正常工作(工作成功)!
问题:我们如何使用batchStrategy: MultiRecord
以及上述数据处理配置来成功运行?
batchStrategy: SingleRecord
的成功输出如下所示:
{"SageMakerOutput":{"prediction":0},"environment":"DEV","transactionId":"5-687sdf87-0bc7e3cb3454dbf261ed1353","mName":"loanprediction","mVersion":"1","requestBody":{"data":{"Age":90,"CCAvg":1,"Experience":26,"Family":3,"Income":30}},"testFlag":"false","timestamp":"2022-01-15T01:45:32.955Z"}
{"SageMakerOutput":{"prediction":0},"environment":"DEV","transactionId":"5-69e22778-594916685f4ceca66c08bfbc","mName":"loanprediction","mVersion":"1","requestBody":{"data":{"Age":55,"CCAvg":1,"Experience":26,"Family":3,"Income":450}},"testFlag":"false","timestamp":"2022-01-15T01:46:32.386Z"}
Region name – optional: Relevant resource ARN – optional:
arn:aws:sagemaker:us-west-2:435945521637:transform-job/my-pipeline-9v28r-bat-e548fbfb125946528957e0f123456789
在使用 MultiRecord 时,您不能使用 dataProcessing 将输入与预测结合起来。在批处理作业之后,您需要分别手动合并输入和预测。 MultiRecord 策略批量处理记录,最大大小在 MaxPayloadInMB 中指定。
对于输入:
input1
input2
input3
input4
input5
input6
输出格式如下:
output1,output2,output3
output3,output4,output6
您需要处理输出文件以与输入数据合并以获得所需的结果。我希望使用 json 数组进行预测,因为您的输出格式是 json。您可以分解数组并将其与输入合并。分解时需要保持预测的顺序。您可以在以下位置找到更多详细信息:
当您的输入数据采用 JSON 行格式并且您选择 SingleRecord BatchStrategy 时,您的容器将收到一个 JSON 有效负载主体,如下所示
{ <some JSON data> }
但是,如果您使用 MultiRecord,批处理转换会将您的 JSON 行输入(例如可能包含 100 行)拆分为多条记录(例如 10 条记录),所有记录一次发送到您的容器,如图所示下面:
{ <some JSON data> }
{ <some JSON data> }
{ <some JSON data> }
{ <some JSON data> }
.
.
.
{ <some JSON data> }
因此您的容器应该能够处理此类输入才能正常工作。但是,从错误消息中,我可以看到它在读取请求的第二行时抱怨 JSON 格式无效。
我还注意到您提供的 ContentType
和 AcceptType
作为 application/json
但应该是 application/jsonlines
能否请您测试一下您的容器,看看它是否可以在每次调用时处理多个 JSON 行记录。
我们正在使用 SageMaker Batch Transform 作业,并在 MaxPayloadInMB
限制内将尽可能多的记录放入小批量中,我们将 BatchStrategy
设置为 MultiRecord
并且SplitType
到 Line
.
SageMaker 批量转换作业的输入是:
{"requestBody": {"data": {"Age": 90, "Experience": 26, "Income": 30, "Family": 3, "CCAvg": 1}}, "mName": "loanprediction", "mVersion": "1", "testFlag": "false", "environment": "DEV", "transactionId": "5-687sdf87-0bc7e3cb3454dbf261ed1353", "timestamp": "2022-01-15T01:45:32.955Z"}
{"requestBody": {"data": {"Age": 55, "Experience": 26, "Income": 450, "Family": 3, "CCAvg": 1}}, "mName": "loanprediction", "mVersion": "1", "testFlag": "false", "environment": "DEV", "transactionId": "5-69e22778-594916685f4ceca66c08bfbc", "timestamp": "2022-01-15T01:46:32.386Z"}
这是 SageMaker 批量转换作业配置:
apiVersion: sagemaker.aws.amazon.com/v1
kind: BatchTransformJob
metadata:
generateName: '...-batchtransform'
spec:
batchStrategy: MultiRecord
dataProcessing:
JoinSource: Input
OutputFilter: $
inputFilter: $.requestBody
modelClientConfig:
invocationsMaxRetries: 0
invocationsTimeoutInSeconds: 3
mName: '..'
region: us-west-2
transformInput:
contentType: application/json
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3Uri: s3://....../part-
splitType: Line
transformOutput:
accept: application/json
assembleWith: Line
kmsKeyId: '....'
s3OutputPath: s3://..../batch_output
transformResources:
instanceCount: ..
instanceType: '..'
SageMaker 批量转换作业失败并显示:
批量转换数据日志出错 -
2022-01-27T00:55:39.781:[sagemaker logs]: ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:
2022-01-27T00:55:39.781:[sagemaker logs]: ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:
400 Bad Request 2022-01-27T00:55:39.781:[sagemaker logs]: ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:Failed to decode JSON object: Extra data: line 2 column 1 (char 163)
观察:
当我们在清单中提供 batchStrategy: MultiRecord
以及这些数据处理配置时,会出现此问题:
dataProcessing:
JoinSource: Input
OutputFilter: $
inputFilter: $.requestBody
注意: 如果我们将 batchStrategy: SingleRecord
与上述数据处理配置一起放置,它就可以正常工作(工作成功)!
问题:我们如何使用batchStrategy: MultiRecord
以及上述数据处理配置来成功运行?
batchStrategy: SingleRecord
的成功输出如下所示:
{"SageMakerOutput":{"prediction":0},"environment":"DEV","transactionId":"5-687sdf87-0bc7e3cb3454dbf261ed1353","mName":"loanprediction","mVersion":"1","requestBody":{"data":{"Age":90,"CCAvg":1,"Experience":26,"Family":3,"Income":30}},"testFlag":"false","timestamp":"2022-01-15T01:45:32.955Z"} {"SageMakerOutput":{"prediction":0},"environment":"DEV","transactionId":"5-69e22778-594916685f4ceca66c08bfbc","mName":"loanprediction","mVersion":"1","requestBody":{"data":{"Age":55,"CCAvg":1,"Experience":26,"Family":3,"Income":450}},"testFlag":"false","timestamp":"2022-01-15T01:46:32.386Z"} Region name – optional: Relevant resource ARN – optional: arn:aws:sagemaker:us-west-2:435945521637:transform-job/my-pipeline-9v28r-bat-e548fbfb125946528957e0f123456789
在使用 MultiRecord 时,您不能使用 dataProcessing 将输入与预测结合起来。在批处理作业之后,您需要分别手动合并输入和预测。 MultiRecord 策略批量处理记录,最大大小在 MaxPayloadInMB 中指定。
对于输入:
input1
input2
input3
input4
input5
input6
输出格式如下:
output1,output2,output3
output3,output4,output6
您需要处理输出文件以与输入数据合并以获得所需的结果。我希望使用 json 数组进行预测,因为您的输出格式是 json。您可以分解数组并将其与输入合并。分解时需要保持预测的顺序。您可以在以下位置找到更多详细信息:
当您的输入数据采用 JSON 行格式并且您选择 SingleRecord BatchStrategy 时,您的容器将收到一个 JSON 有效负载主体,如下所示
{ <some JSON data> }
但是,如果您使用 MultiRecord,批处理转换会将您的 JSON 行输入(例如可能包含 100 行)拆分为多条记录(例如 10 条记录),所有记录一次发送到您的容器,如图所示下面:
{ <some JSON data> }
{ <some JSON data> }
{ <some JSON data> }
{ <some JSON data> }
.
.
.
{ <some JSON data> }
因此您的容器应该能够处理此类输入才能正常工作。但是,从错误消息中,我可以看到它在读取请求的第二行时抱怨 JSON 格式无效。
我还注意到您提供的 ContentType
和 AcceptType
作为 application/json
但应该是 application/jsonlines
能否请您测试一下您的容器,看看它是否可以在每次调用时处理多个 JSON 行记录。