如何立即将所有 Snakemake 作业提交到 slurm 集群

Question

我正在使用 snakemake 构建一个可以在 SLURM 集群上运行的变体调用管道。集群有登录节点和计算节点。任何真正的计算都应该以 srun 或 sbatch 作业的形式在计算节点上完成。作业限制为 48 小时运行时间。我的问题是处理很多样本，尤其是当队列繁忙时，将花费超过 48 小时来处理每个样本的所有规则。 snakemake 的传统集群执行留下一个主线程运行ning，它仅在所有规则的依赖项完成运行ning 后才将规则提交到队列。我应该在计算节点上运行这个主程序，所以这将我整个管道的运行时间限制为 48 小时。

我知道 SLURM 作业有依赖指令告诉作业等待运行直到其他作业完成。因为 snakemake 工作流是一个 DAG，是否可以一次提交所有作业，每个作业都具有由 DAG 中的规则依赖项定义的依赖项？在提交所有作业后，主线程将完成，从而绕过了 48 小时的限制。 snakemake 是否可行？如果可行，它是如何工作的？我找到了 --immediate-submit 命令行选项，但我不确定这是否具有我正在寻找的行为以及如何使用该命令，因为我的集群在提交作业后打印 Submitted batch job [id]到队列而不仅仅是作业 ID。

Answer 1

不幸的是，立即提交无法正常工作开箱即用，但需要进行一些调整才能正常工作。这是因为作业之间的依赖关系在集群系统之间传递的方式不同。不久前，我遇到了同样的问题。正如立即提交文档所说：

Immediately submit all jobs to the cluster instead of waiting for present input files. This will fail, unless you make the cluster aware of job dependencies, e.g. via: $ snakemake –cluster ‘sbatch –dependency {dependencies}. Assuming that your submit script (here sbatch) outputs the generated job id to the first stdout line, {dependencies} will be filled with space separated job ids this job depends on.

所以问题是sbatch没有将生成的job id输出到stdout的第一行。但是我们可以用我们自己的 shell 脚本来绕过这个：

parseJobID.sh:

#!/bin/bash
# helper script that parses slurm output for the job ID,
# and feeds it to back to snakemake/slurm for dependencies.
# This is required when you want to use the snakemake --immediate-submit option

if [[ "Submitted batch job" =~ "$@" ]]; then
  echo -n ""
else
  deplist=$(grep -Eo '[0-9]{1,10}' <<< "$@" | tr '\n' ',' | sed 's/.$//')
  echo -n "--dependency=aftercorr:$deplist"
fi;

并确保使用 chmod +x parseJobID.sh.

授予脚本执行权限

然后我们可以像这样调用立即提交：

snakemake --cluster 'sbatch $(./parseJobID.sh {dependencies})' --jobs 100 --notemp --immediate-submit

请注意，这将最多同时提交 100 个作业。您可以将其增加或减少到您喜欢的任何数字，但要知道大多数集群系统不允许每个用户同时处理超过 1000 个作业。

如何立即将所有 Snakemake 作业提交到 slurm 集群

How to immediately submit all Snakemake jobs to slurm cluster

python

pipeline

bioinformatics

slurm

snakemake