SLURM `srun` 与 `sbatch` 及其参数

SLURM `srun` vs `sbatch` and their parameters

我试图了解 SLURM 的 srun and sbatch 命令之间的区别。我会很高兴有一个一般性的解释,而不是对以下问题的具体答案,但这里有一些具体的混淆点可以作为起点并让我了解我正在寻找的东西。

根据 documentationsrun 用于提交作业,sbatch 用于提交作业供以后执行,但我不清楚实际区别,以及它们的行为似乎是一样的。例如,我有一个包含 2 个节点的集群,每个节点有 2 CPUs。如果我连续执行 srun testjob.sh & 5x,它将很好地排队第五个作业,直到 CPU 可用,执行 sbatch testjob.sh.

也是如此

为了使问题更具体,我认为一个好的起点可能是:有哪些我可以用一个做而另一个不能做的事情,为什么?

这两个命令的许多参数是相同的。看起来最相关的是 --ntasks--nodes--cpus-per-task--ntasks-per-node它们之间有何关联,srunsbatch 有何不同?

一个特别的区别是,如果 testjob.sh 没有可执行权限,即 chmod +x testjob.shsrun 会导致错误,而 sbatch 会很乐意 运行 它. 发生了什么事 "under the hood" 导致了这种情况?

文档还提到 srun 通常在 sbatch 脚本中使用。这就引出了一个问题:它们如何相互作用,它们的 "canonical" 用例是什么?具体来说,我会单独使用 srun 吗?

文档说

srun is used to submit a job for execution in real time

sbatch is used to submit a job script for later execution.

它们都接受几乎相同的参数集。主要区别在于 srun 是交互式和阻塞的(你在终端中得到结果,在它完成之前你不能编写其他命令),而 sbatch 是批处理和非阻塞的(结果是写入文件,您可以立即提交其他命令。

如果你在后台使用srun加上&标志,那么你去掉了srun的'blocking'特性,变成了交互但非阻塞。它仍然是交互式的,这意味着输出会使您的终端混乱,并且 srun 进程链接到您的终端。如果你断开连接,你将失去对它们的控制,或者它们可能会被杀死(取决于它们是否使用 stdout 基本上)。如果您连接以提交作业的机器重新启动,它们将被杀死。

如果您使用 sbatch,您提交作业并由 Slurm 处理;您可以断开连接、终止您的终端等,而不会产生任何后果。您的工作不再链接到 运行ning 进程。

What are some things that I can do with one that I cannot do with the other, and why?

sbatchsrun 不可用的一项功能是 job arrays。由于 srun 可以在 sbatch 脚本中使用,因此您可以使用 sbatch.

无所不能

How are these related to each other, and how do they differ for srun vs sbatch?

所有参数--ntasks--nodes--cpus-per-task--ntasks-per-node在两个命令中的含义相同。几乎所有参数都是如此,但 --exclusive.

明显例外。

What is happening "under the hood" that causes this to be the case?

srun 立即在远程主机上执行脚本,而 sbatch 将脚本复制到内部存储中,然后在作业开始时将其上传到计算节点。您可以在提交后通过修改提交脚本来检查这一点;不会考虑更改(请参阅 this)。

How do they interact with each other, and what is the "canonical" use-case for each of them?

您通常使用 sbatch 来提交作业,并在提交脚本中使用 srun 来创建 Slurm 调用的作业步骤。 srun 用于启动进程。如果您的程序是并行 MPI 程序,srun 会负责创建所有 MPI 进程。否则,srun 将 运行 您的程序执行 --ntasks 选项指定的次数。有很多用例取决于你的程序是否并行,有没有long-运行ning时间,是否由单个可执行文件组成等等。除非另有说明,否则srun继承默认情况下 sbatchsalloc 的相关选项 运行 位于(来自 here)。

Specifically, would I ever use srun by itself?

除了小测试,没有。一个常见的用法是 srun --pty bash 在计算作业上获得 shell。

这实际上并没有完全回答问题,但这里有一些我发现的更多信息,可能对以后的人有帮助:


来自有类似问题的related thread I found

In a nutshell, sbatch and salloc allocate resources to the job, while srun launches parallel tasks across those resources. When invoked within a job allocation, srun will launch parallel tasks across some or all of the allocated resources. In that case, srun inherits by default the pertinent options of the sbatch or salloc which it runs under. You can then (usually) provide srun different options which will override what it receives by default. Each invocation of srun within a job is known as a job step.

srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.

There's a relatively new web page which goes into more detail regarding the -B and --exclusive options.

doc/html/cpu_management.shtml


来自 SLURM FAQ 页面的其他信息。

The srun command has two different modes of operation. First, if not run within an existing job (i.e. not within a Slurm job allocation created by salloc or sbatch), then it will create a job allocation and spawn an application. If run within an existing allocation, the srun command only spawns the application. For this question, we will only address the first mode of operation and compare creating a job allocation using the sbatch and srun commands.

The srun command is designed for interactive use, with someone monitoring the output. The output of the application is seen as output of the srun command, typically at the user's terminal. The sbatch command is designed to submit a script for later execution and its output is written to a file. Command options used in the job allocation are almost identical. The most noticable difference in options is that the sbatch command supports the concept of job arrays, while srun does not. Another significant difference is in fault tolerance. Failures involving sbatch jobs typically result in the job being requeued and executed again, while failures involving srun typically result in an error message being generated with the expectation that the user will respond in an appropriate fashion.


另一个相关对话here