将 snakefile 修改为运行一个工作流程的多次迭代

Question

我有一个带有单个 Snakefile 和单个配置文件的 Snakemake 工作流程。在我的 Snakefile 中，我指定了一个不按顺序编号的作业（例如 210,215）。对于我可以指定的每个作业，配置文件都有一个对应的条目，其中包含有关该特定作业的信息（参数包括年份、子作业数量、文件前缀等，所有这些都存储为字符串）。在规则中，为了构造输入和输出，我使用 config[job]["year"] 等语句为每个作业提供正确的字符串。

我的工作流程的一个简化示例，希望能展示我如何使用配置文件中的信息：

# SNAKEFILE
job=210
rule all:
    input:
        expand(config["outputdir"]+"/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root",sample=config[job]["samples"])
...other rules...
rule filter_2:
    input:
        config["outputdir"]+"/filter-1-applied/{sj}/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root"
    output:
        config["outputdir"]+"/filter-2-applied/{sj}/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root"
    shell:
        "(bash scripts/filter-2.sh {input} {output}) 2> {log}"
...other rules...

CONFIG.YAML
outputdir="/home/ghl/outputs"
210:                                                                                                                                                                                                               
    prefix: "Real"
    year: "2016"
    origindir: "/home/ghl/files/210"
    subjobs: 2653
    originID: "_abc123"
    samples: ["type1_v1","type1_v2","type2_v1","type2_v2"]

当我有少量工作时这很好，但现在我有 ~80 到运行，有些即使在我可以访问的批量提交系统上提交也需要几个小时，它每次手动运行需要很长时间，等待，更改 'job' 属性，然后再次运行。我想做的是能够从这个 Snakefile 的单个运行中运行多个作业（例如 210 和 215）。

在 python 中，我会将所有内容都包含在一个循环中，例如：

for job in [1,3,...,210,215]:
    <run single job workflow>
print("Done!")

我正在尝试在我的 Snakefile 中做同样的事情。我已经尝试将 job=jobs 放入 'rule all' 的输入中，就像我对示例所做的那样，并定义 jobs=[210,215]，或者将输入更改为 returns 相应文件名的函数从工作列表中，但运行都涉及与 'job' 不再是脚本中的 python 变量但现在是通配符这一事实相关的问题，我不清楚我应该如何为 config[job]["year"] :
之类的东西提供通配符 config[{job}]["year"] 或 config["{job}"] 不起作用（具体来说，他们给出 NameError 或 KeyError）。

有没有办法实现这一点（理想情况下无需完全重写）？按照我提到的内容进行修改（或者以某种方式运行从单独的 snakefile 中修改此工作流程？）将是理想的，我想这可能只需替换 config[job] 的所有实例即可实现使用并更改 'rule all' 的输入以包含工作编号列表...

提前致谢！

Answer 1

如果其他人想知道我是如何解决这个问题的，它需要重写一些东西，并且相当广泛地使用 lambda 函数，此外，所有文件现在都以他们的工作编号为前缀（我有一个 bash 在 snakemake 之外运行的脚本来删除它们）。我敢肯定，其中大部分超出了要求，但对我来说已经足够好了。

我在配置中指定了一个作业列表：jobs: [j210,j215]（j 前缀是必需的，因为如果 snakemake 将它们解释为整数而不是字符串，则会出现关键错误，原因我不太明白）

我添加了一个额外的 make_final 规则，该规则仅取决于作业，并修改了所有规则（并且还使用了很多通配符约束，您对它们的需求可能会有所不同）。这使得 job 成为通配符，因此 config[job] 可以在 input 或 params 中使用 lambda 函数访问：config[wildcards.job]

rule all:
    input:
       expand("completed/{job}.log",job=config["jobs"])

rule make_final:
    # this input is just my final file from the chain of rules. Awkward syntax as requires a list expansion - each source job produces 4 output files
    input:
        lambda wildcards : [(config["outputdir"]+"/{job}_"+config[wildcards.job]["prefix"]+"_test_"+config[wildcards.job]["year"]+config[wildcards.job]["originID"]+"_"+foobar+".root") for foobar in config[wildcards.job]["samples"]],
    output:
        "completed/{job}.log"
    shell:
        "touch {output}"

并且修改了之前的规则，例如像这样：

rule filter_2_mc:
    input:
        # this new approach allows neater/more natural phrasing here, rather than
        # using lots of config[job]["blah"] statements
        config["outputdir"]+"/filter-1-applied/{sj}/{job}_{prefix}_test_{year}{originID}_{sample}.root"
    output:
        config["outputdir"]+"/filter-2-applied/{sj}/{job}_{prefix}_test_{year}{originID}_{sample}.root"                                                                                                                       
    shell:
        "bash scripts/filter-2-new.sh {input} {output}"

一些规则的输入需要 lambda 函数：或参数：如果需要指定配置[wildcards.job]中的任何内容。

（如果不允许回答我自己的问题并将其标记为正确答案，我也深表歉意）

将 snakefile 修改为 运行 一个工作流程的多次迭代

Modify snakefile to run multiple iterations of one workflow

python

snakemake

将 snakefile 修改为运行一个工作流程的多次迭代