Slurm:如何重新启动失败的工作者作业

Slurm: How to restart failed worker job

如果一个是 运行 slurm 集群上的数组作业,如何重新启动失败的工作作业?

在 Sun Grid Engine 队列中,可以将 #$ -r y 添加到作业文件以指示作业在失败时应重新启动——这个标志在 Slurm 中的等效项是什么?

您可以使用 --requeue

#SBATCH --requeue                   ### On failure, requeue for another try

--requeue

Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.

在此处查看更多信息:https://slurm.schedmd.com/sbatch.html#lbAE