Python 进程终止 - docker，aws 和 conda 问题

Question

我有一个 python 脚本，运行在本地云 9 上运行良好。我正在将它迁移到 Fargate (codebuild)。我遇到了一些常见的问题，我解决了找不到 python 模块等的问题，只是将它们添加到环境中。看起来这些问题已经解决了，但我从 python 收到了一条非常奇怪的终止消息。这是我从日志中得到的错误消息；

ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['python', 'server.py']' command failed.  (See above for error)

/opt/conda/envs/myenv/.tmpjsztda5o: line 3:    22 Killed                  python server.py

fargate 日志奇怪地显示，这些错误每次运行时都会发生一点变化（它会每 ~2-3 分钟产生一次运行的新尝试）。例如

ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['python', 'server.py']' command failed.  (See above for error)
/opt/conda/envs/myenv/.tmp3hrkajas: line 3:    21 Killed                  python server.py

这是 Dockerfile。

FROM public.ecr.aws/lts/ubuntu:latest
RUN echo Updating existing packages, installing and upgrading python and pip.

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update -y
RUN apt-get install build-essential -y
RUN apt-get install g++ -y
RUN apt-get install gcc -y
RUN apt-get install gdal-bin -y
RUN which gcc 
RUN echo $PATH

FROM osgeo/gdal:ubuntu-small-latest
FROM continuumio/miniconda3
WORKDIR /app
## Create the environment:
COPY environment.yml .
RUN conda env create -f environment.yml
#Make RUN commands use the new environment:
RUN echo "conda activate myenv" >> ~/.bashrc
SHELL ["conda", "run", "-n", "myenv", "/bin/bash", "-c"]

# Demonstrate the environment is activated:
RUN echo "Make sure flask is installed:"
RUN python -c "import flask"


RUN echo Copy service directory

COPY ./PHREEQC /PHREEQC
COPY ./service /service
COPY ./temp_files /temp_files
COPY ./INPUT_DATA /INPUT_DATA

WORKDIR /service

ENTRYPOINT ["conda", "run", "--no-capture-output", "-n", "myenv", "python","server.py"]

如果您对如何开始调试此问题有任何想法，请告诉我！

回答

感谢 itamar，我能够通过增加容器的大小来解决问题

感谢 itamar，这确实是内存问题，因此将 ECR 实例重新配置为稍大的实例。

注销现有任务定义 ->

aws ecs deregister-task-definition --task-definition ecs_name:ID

注册具有更新大小的新任务定义。

大小 -> 从 cpu 256 和内存 512 更改为 ->

 "family": "fam_name",
 "cpu": "1024",
 "memory": "4096",
 "networkMode": "awsvpc",
 "requiresCompatibilities": [
   "FARGATE"....

然后

aws ecs register-task-definition --cli-input-json file://~/environment/aws-cli/task-definition.json

更新服务

aws ecs update-service --cluster name-Cluster --service name-Service --task-definition ecs_name:ID

希望对您有所帮助！！

Answer 1

如果您的进程被终止（与崩溃不同），最有可能的问题是它运行耗尽了内存。

Linux 有启发式方法，它会尝试检测使用过多内存的应用程序，然后 kill -9s 它们（macOS 有类似的系统）。
在容器中，也有内存限制，当命中时会终止您的进程。

有关其他症状，请参阅 https://pythonspeed.com/articles/python-out-of-memory/。

第一个建议：为您的容器配置更多内存。例如。如果您使用的是 Fargate，则可以为您的任务提供更多内存，最高可达 30GB。如果 30GB 不够用，您需要运行将其转移到其他地方或减少内存使用量。

要了解内存使用情况，您可以使用 https://pythonspeed.com/fil/. You can find suggestions on reducing memory usage at https://pythonspeed.com/memory/。

Python 进程终止 - docker，aws 和 conda 问题

Python process killed - docker, aws and conda issue

python

linux

amazon-ec2

docker

conda