有没有办法在 AWS Batch 作业上设置挂钟时间?

Is there a way to set a walltime on AWS Batch jobs?

有没有办法为 AWS Batch 作业(或队列)设置最大 运行 时间?这是大多数批处理管理器中的标准设置,可避免作业因任何原因挂起时浪费资源。

据我所知,没有执行此操作的功能。但是,forum 中针对类似问题提出了解决方法。

One idea is to call Batch as an Activity from Step Functions, pingback back on a schedule (e.g. every minute) from that job. If it stops responding then you can detect that situation as a Timeout in the activity and act accordingly (terminate the job etc.). Not an ideal solution (especially if the job continues to ping back as a "zombie"), but it's a start. You'd also likely have to store activity tokens in a database to trace them to Batch job id.

Alternatively, you split that setup into 2 steps, and schedule a Batch job from a Lambda in the first state, then pass the Batch job id to the second step which then polls Batch (from another Lambda) for its state with Retry and IntervalSeconds (e.g. once every minute, or even with exponential backoff), and MaxAttempts calculated based on your timeout. This way, you don't need any external state storage mechanism, long polling or even a "ping back" from the job (it CAN be a zombie), but the downside is more steps.

没有在批处理作业上设置 timeout 的选项,但您可以设置一个 lambda 函数,它每 1 小时左右触发一次,并删除在说 24 hours 之前创建的作业。

现在使用 aws 有一段时间了,但找不到为批处理作业设置最大 运行 时间的方法。 但是,您可以使用一些替代方法。 AWS Forum

遗憾的是,无法在 AWS Batch 上设置限制执行时间。 一种解决方案可能是编辑 docker 的入口点以安排执行时间限制。

截至 2018 年 4 月,AWS Batch 现在支持在提交作业时或在作业定义中设置 Job Timeout

https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/

You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.

来源:https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html

POST /v1/submitjob HTTP/1.1
Content-type: application/json

{
   ...
   "timeout": { 
      "attemptDurationSeconds": number
   }
}