aws sagemaker部署模型报错怎么解决？

Question

我必须在 AWS Sagemaker 中部署自定义 keras 模型。我创建了一个笔记本实例，我有以下文件：

AmazonSagemaker-Codeset16
   -ann
      -nginx.conf
      -predictor.py
      -serve
      -train.py
      -wsgi.py
   -Dockerfile

我现在打开 AWS 终端并构建 docker 图像并将图像推送到 ECR 存储库中。然后我打开一个新的 jupyter python notebook 并尝试适应模型并部署它。培训已正确完成，但在部署时出现以下错误：

"Error hosting endpoint sagemaker-example-2019-10-25-06-11-22-366: Failed. >Reason: The primary container for production variant AllTraffic did not pass >the ping health check. Please check CloudWatch logs for this endpoint..."

当我检查日志时，我发现以下内容：

2019/11/11 11:53:32 [crit] 19#19: *3 connect() to unix:/tmp/gunicorn.sock >failed (2: No such file or directory) while connecting to upstream, client: >10.32.0.4, server: , request: "GET /ping HTTP/1.1", upstream: >"http://unix:/tmp/gunicorn.sock:/ping", host: "model.aws.local:8080"

和

Traceback (most recent call last): File "/usr/local/bin/serve", line 8, in sys.exit(main()) File "/usr/local/lib/python2.7/dist->packages/sagemaker_containers/cli/serve.py", line 19, in main server.start(env.ServingEnv().framework_module) File "/usr/local/lib/python2.7/dist->packages/sagemaker_containers/_server.py", line 107, in start module_app, File "/usr/lib/python2.7/subprocess.py", line 711, in init errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child raise child_exception

我尝试在本地计算机中使用这些文件在 AWS Sagemaker 中部署相同的模型，并且模型已成功部署，但在 AWS 内部，我遇到了这个问题。

这是我的服务文件代码：

from __future__ import print_function
import multiprocessing
import os
import signal
import subprocess
import sys

cpu_count = multiprocessing.cpu_count()

model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))


def sigterm_handler(nginx_pid, gunicorn_pid):
    try:
        os.kill(nginx_pid, signal.SIGQUIT)
    except OSError:
        pass
    try:
        os.kill(gunicorn_pid, signal.SIGTERM)
    except OSError:
        pass

    sys.exit(0)


def start_server():
    print('Starting the inference server with {} workers.'.format(model_server_workers))


    # link the log streams to stdout/err so they will be logged to the container logs
    subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
    subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])

    nginx = subprocess.Popen(['nginx', '-c', '/opt/ml/code/nginx.conf'])
    gunicorn = subprocess.Popen(['gunicorn',
                                 '--timeout', str(model_server_timeout),
                                 '-b', 'unix:/tmp/gunicorn.sock',
                                 '-w', str(model_server_workers),
                                 'wsgi:app'])

    signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))

    # If either subprocess exits, so do we.
    pids = set([nginx.pid, gunicorn.pid])
    while True:
        pid, _ = os.wait()
        if pid in pids:
            break

    sigterm_handler(nginx.pid, gunicorn.pid)
    print('Inference server exiting')


# The main routine just invokes the start function.
if __name__ == '__main__':
    start_server()

我使用以下方法部署模型：

predictor = classifier.deploy(1, 'ml.t2.medium', serializer=csv_serializer)

请让我知道我在部署时犯的错误。

Answer 1

使用 Sagemaker 脚本模式比处理容器和 nginx 低级别的东西要简单得多，就像您尝试做的那样，您考虑过吗？
你只需要提供keras脚本：

With Script Mode, you can use training scripts similar to those you would use outside SageMaker with SageMaker's prebuilt containers for various deep learning frameworks such TensorFlow, PyTorch, and Apache MXNet.

https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-sentiment-script-mode/sentiment-analysis.ipynb

Answer 2

您应该确保您的容器可以响应 GET /ping 请求：https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-algo-ping-requests

从回溯来看，当容器在 SageMaker 中启动时，服务器似乎无法启动。我会进一步查看堆栈跟踪，看看服务器启动失败的原因。

您也可以尝试运行您的本地容器来调试任何问题。 SageMaker 使用命令 'docker run serve' 启动您的容器，因此您可以运行相同的命令并调试您的容器。 https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image

Answer 3

你没有安装gunicorn，所以报错/tmp/gunicorn.sock >failed (2: No such file or directory)，你需要在Dockerfile上写pip install gunicorn和apt-get install nginx.

aws sagemaker部署模型报错怎么解决？

How to solve the error with deploying a model in aws sagemaker?

amazon-web-services

amazon-sagemaker