Slurm:如何重新启动失败的工作者作业
Slurm: How to restart failed worker job
如果一个是 运行 slurm 集群上的数组作业,如何重新启动失败的工作作业?
在 Sun Grid Engine 队列中,可以将 #$ -r y
添加到作业文件以指示作业在失败时应重新启动——这个标志在 Slurm 中的等效项是什么?
您可以使用 --requeue
#SBATCH --requeue ### On failure, requeue for another try
--requeue
Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.
如果一个是 运行 slurm 集群上的数组作业,如何重新启动失败的工作作业?
在 Sun Grid Engine 队列中,可以将 #$ -r y
添加到作业文件以指示作业在失败时应重新启动——这个标志在 Slurm 中的等效项是什么?
您可以使用 --requeue
#SBATCH --requeue ### On failure, requeue for another try
--requeue
Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.