slurm:即使在崩溃的作业重新排队后,DependencyNeverSatisfied 错误

slurm: DependencyNeverSatisfied error even after crashed job re-queued

我的目标是使用 slurm 依赖项构建管道并处理 slurm 作业崩溃的情况。

根据下面 and guide第29节,建议使用scontrol requeue $jobID,这将重新排队已经取消的作业。

if job crashes can be detected from within the submission script, and crashes are random, you can simply requeue the job with scontrol requeue $SLURM_JOB_ID so that it runs again.


在我重新排队取消的作业后,它的依赖作业仍然是 DependencyNeverSatisfied,甚至依赖作业完成也没有任何反应。如果取消的作业再次重新排队,有没有办法更新依赖作业的状态?

示例:

$ sbatch run.sh
Submitted batch job 1
$ sbatch  --dependency=aftercorr:1 run.sh
$ squeue
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            89     debug   run.sh    alper PD       0:00      1 (Dependency)
            88     debug   run.sh    alper  R       0:23      1 ebloc1

$ scancel 1
$ squeue
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            89     debug   run.sh    alper PD       0:00      1 (DependencyNeverSatisfied)

$ scontrol requeue 1
$ squeue
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            89     debug   run.sh    alper PD       0:00      1 (DependencyNeverSatisfied)
            88     debug   run.sh    alper  R       0:00      1 ebloc1
#After running job completed dependent job still remain as DependencyNeverSatisfied state:
$ squeue
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            89     debug   run.sh    alper PD       0:00      1 (DependencyNeverSatisfied)

After I have re-queued a cancelled job, its dependent job remain as DependencyNeverSatisfied and even dependent job completed nothing happens. Is there any way to update dependent job's state, if cancelled job is re-queued again?

是的,这很简单。使用 scontrol.

重置依赖关系

scontrol update jobid=[依赖作业 id] dependency=after:[重新排队作业 id]

我以 Slurm 17.11 版为例:

$ sbatch --begin=now+60 --wrap="exit 1"                   
Submitted batch job 540912

$ sbatch --dependency=afterok:540912 --wrap=hostname 
Submitted batch job 540913

$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        540912     debug     wrap marshall PD       0:00      1 (BeginTime)
        540913     debug     wrap marshall PD       0:00      1 (Dependency)
$ scancel 540912
$ scontrol requeue 540912
$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        540912     debug     wrap marshall PD       0:00      1 (BeginTime)
        540913     debug     wrap marshall PD       0:00      1 (DependencyNeverSatisfied)

至此,我已经复制了您的情况。作业 540912 已重新排队,作业 540913 的原因为 "DependencyNeverSatisfied".

现在,您可以通过发出 scontrol update job:

来修复它
$ scontrol update jobid=540913 dependency=after:540912
$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        540912     debug     wrap marshall PD       0:00      1 (BeginTime)
        540913     debug     wrap marshall PD       0:00      1 (Dependency)

状态已定!作业运行后,从属作业也会运行:

$ scontrol update jobid=540912 starttime=now
$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        540912     debug     wrap marshall CG       0:00      1 v1
        540913     debug     wrap marshall PD       0:00      1 (Dependency)
$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

squeue 的输出为空,因为作业已经完成。

您可以在 sacct 完成后查看作业:

$ sacct -j 540912,540913
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
540912             wrap      debug       test          2     FAILED      1:0 
540912.batch      batch                  test          2     FAILED      1:0 
540912.exte+     extern                  test          2  COMPLETED      0:0 
540913             wrap      debug       test          2  COMPLETED      0:0 
540913.batch      batch                  test          2  COMPLETED      0:0 
540913.exte+     extern                  test          2  COMPLETED      0:0