如何在 AWS sagemaker 上部署预训练的 sklearn 模型? (端点停留在创建)

How do I deploy a pre trained sklearn model on AWS sagemaker? (Endpoint stuck on creating)

首先,我知道这个问题已被多次询问,但我还没有找到解决问题的方法。

所以,首先我使用 joblib.dump 保存了一个本地训练的 sklearn RandomForest。然后我将其上传到 s3,创建了一个名为 code 的文件夹并在其中放入了一个推理脚本,名为 inference.py.

import joblib
import json
import numpy
import scipy
import sklearn
import os

"""
Deserialize fitted model
"""
def model_fn(model_dir):
    model_path = os.path.join(model_dir, 'test_custom_model')
    model = joblib.load(model_path)
    return model

"""
input_fn
    request_body: The body of the request sent to the model.
    request_content_type: (string) specifies the format/variable type of the request
"""
def input_fn(request_body, request_content_type):
    if request_content_type == 'application/json':
        request_body = json.loads(request_body)
        inpVar = request_body['Input']
        return inpVar
    else:
        raise ValueError("This model only supports application/json input")

"""
predict_fn
    input_data: returned array from input_fn above
    model (sklearn model) returned model loaded from model_fn above
"""
def predict_fn(input_data, model):
    return model.predict(input_data)

"""
output_fn
    prediction: the returned value from predict_fn above
    content_type: the content type the endpoint expects to be returned. Ex: JSON, string
"""

def output_fn(prediction, content_type):
    res = int(prediction[0])
    respJSON = {'Output': res}
    return respJSON

到目前为止非常简单。

我也把它放到了本地的 jupyter sagemaker 会话中

all_files(文件夹) 代码(文件夹) inference.py(python 文件) test_custom_model(模型的作业库转储)

脚本将此文件夹 all_files 转换为 tar.gz 文件

然后是我运行在sagemaker上的主脚本:

import boto3
import json
import os
import joblib
import pickle
import tarfile
import sagemaker
import time
from time import gmtime, strftime
import subprocess
from sagemaker import get_execution_role

#Setup
client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")
boto_session = boto3.session.Session()
s3 = boto_session.resource('s3')
region = boto_session.region_name
print(region)
sagemaker_session = sagemaker.Session()
role = get_execution_role()

#Bucket for model artifacts
default_bucket = 'pretrained-model-deploy'
model_artifacts = f"s3://{default_bucket}/test_custom_model.tar.gz"

#Build tar file with model data + inference code
bashCommand = "tar -cvpzf test_custom_model.tar.gz all_files"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

#Upload tar.gz to bucket
response = s3.meta.client.upload_file('test_custom_model.tar.gz', default_bucket, 'test_custom_model.tar.gz')

# retrieve sklearn image
image_uri = sagemaker.image_uris.retrieve(
    framework="sklearn",
    region=region,
    version="0.23-1",
    py_version="py3",
    instance_type="ml.m5.xlarge",
)

#Step 1: Model Creation
model_name = "sklearn-test" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)
create_model_response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": image_uri,
            "ModelDataUrl": model_artifacts,
        }
    ],
    ExecutionRoleArn=role,
)
print("Model Arn: " + create_model_response["ModelArn"])

#Step 2: EPC Creation - Serverless
sklearn_epc_name = "sklearn-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
response = client.create_endpoint_config(
   EndpointConfigName=sklearn_epc_name,
   ProductionVariants=[
        {
            "ModelName": model_name,
            "VariantName": "sklearnvariant",
            "ServerlessConfig": {
                "MemorySizeInMB": 2048,
                "MaxConcurrency": 20
            }
        } 
    ]
)

# #Step 2: EPC Creation - Synchronous
# sklearn_epc_name = "sklearn-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
# endpoint_config_response = client.create_endpoint_config(
#     EndpointConfigName=sklearn_epc_name,
#     ProductionVariants=[
#         {
#             "VariantName": "sklearnvariant",
#             "ModelName": model_name,
#             "InstanceType": "ml.m5.xlarge",
#             "InitialInstanceCount": 1
#         },
#     ],
# )
# print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

#Step 3: EP Creation
endpoint_name = "sklearn-local-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=sklearn_epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])


#Monitor creation
describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response)
    time.sleep(15)
print(describe_endpoint_response)

现在,我主要只想要无服务器部署,但一段时间后失败并显示此错误消息:

{'EndpointName': 'sklearn-local-ep2022-04-29-12-16-10', 'EndpointArn': 'arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-12-16-10', 'EndpointConfigName': 'sklearn-epc2022-04-29-12-16-03', 'EndpointStatus': 'Creating', 'CreationTime': datetime.datetime(2022, 4, 29, 12, 16, 10, 290000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2022, 4, 29, 12, 16, 11, 52000, tzinfo=tzlocal()), 'ResponseMetadata': {'RequestId': '1d25120e-ddb1-474d-9c5f-025c6be24383', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '1d25120e-ddb1-474d-9c5f-025c6be24383', 'content-type': 'application/x-amz-json-1.1', 'content-length': '305', 'date': 'Fri, 29 Apr 2022 12:21:59 GMT'}, 'RetryAttempts': 0}}
{'EndpointName': 'sklearn-local-ep2022-04-29-12-16-10', 'EndpointArn': 'arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-12-16-10', 'EndpointConfigName': 'sklearn-epc2022-04-29-12-16-03', 'EndpointStatus': 'Failed', 'FailureReason': 'Unable to successfully stand up your model within the allotted 180 second timeout. Please ensure that downloading your model artifacts, starting your model container and passing the ping health checks can be completed within 180 seconds.', 'CreationTime': datetime.datetime(2022, 4, 29, 12, 16, 10, 290000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2022, 4, 29, 12, 22, 2, 68000, tzinfo=tzlocal()), 'ResponseMetadata': {'RequestId': '59fb8ddd-9d45-41f5-9383-236a2baffb73', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '59fb8ddd-9d45-41f5-9383-236a2baffb73', 'content-type': 'application/x-amz-json-1.1', 'content-length': '559', 'date': 'Fri, 29 Apr 2022 12:22:15 GMT'}, 'RetryAttempts': 0}}

实时部署只是永久停留在创建过程中。

Cloudwatch 出现以下错误: 错误处理请求 /ping

AttributeError: 'NoneType' 对象没有属性 'startswith'

带追溯:

Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/gunicorn/workers/base_async.py", line 55, in handle
    self.handle_request(listener_name, req, client, addr)

复制粘贴已停止工作,所以我附上了它的图片。

这是我收到的错误消息: 端点 Arn:arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-13-18-09 {'EndpointName': 'sklearn-local-ep2022-04-29-13-18-09', 'EndpointArn': 'arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-13-18-09', 'EndpointConfigName': 'sklearn-epc2022-04-29-13-18-07', 'EndpointStatus': 'Creating', 'CreationTime': datetime.datetime(2022, 4, 29, 13, 18, 9, 548000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2022, 4, 29, 13, 18, 13, 119000, tzinfo=tzlocal()), 'ResponseMetadata': {'RequestId': 'ef0e49ee-618e-45de-9c49-d796206404a4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'ef0e49ee-618e-45de-9c49-d796206404a4', 'content-type': 'application/x-amz-json-1.1', 'content-length': '306', 'date': 'Fri, 29 Apr 2022 13:18:24 GMT'}, 'RetryAttempts': 0}}

这些是我与该角色关联的权限:

AmazonSageMaker-ExecutionPolicy
SecretsManagerReadWrite
AmazonS3FullAccess
AmazonSageMakerFullAccess
EC2InstanceProfileForImageBuilderECRContainerBuilds
AWSAppRunnerServicePolicyForECRAccess

我做错了什么?我已经为 zip 文件、不同的帐户尝试了不同的文件夹结构,但都无济于事。我真的不想使用 model.deploy() 方法,因为我不知道如何使用无服务器,而且它在不同模型类型之间也不一致(我正在尝试制作一个灵活的部署管道,其中可以部署不同的 (xgb / sklearn) 模型,只需进行最小的更改。

请提供帮助,我的头发和笔记本电脑都差点摔坏了,我为此苦苦挣扎了整整 4 天。

请遵循本指南:https://github.com/RamVegiraju/Pre-Trained-Sklearn-SageMaker。在模型创建期间,我认为您的推理脚本未在环境变量中指定。

我认为问题出在压缩文件上。从这个问题我了解到您正在尝试压缩所有文件,包括模型转储和脚本。

我建议从模型工件中删除推理脚本。

model.tar.gz 文件应仅包含模型。

并按照@ram-vegiraju的建议添加推理脚本的环境变量。

脚本应该在本地可用。

我已经解决了这个问题 - 我使用 sagemaker.model.model 加载了我已经拥有的模型数据,我调用了上述模型对象的 deploy 方法来部署它。此外,我把推理脚本和模型文件放在笔记本的同一个地方,直接调用它们,因为这也让我早先报错。