snakemake - 为没有通配符的聚合规则定义输入

snakemake - define input for aggregate rule without wildcards

我正在编写一个 snakemake 来从纳米孔测序中生成 Sars-Cov-2 变体。我正在编写的管道基于 artic network,因此我使用 artic guppyplexartic minion.

我写的snakemake有以下步骤:

  1. 压缩所有条形码的所有 fastq 文件(规则 zipFq
  2. 使用 guppyplex 执行读取过滤(规则 guppyplex
  3. 调用 artic minion 管道(规则 minion
  4. 将 stderr 和 stdout 从 qsub 移动到工作目录下的文件夹(规则 mvQsubLogs

下面是我目前写的 snakemake,可以正常工作

barcodes = ['barcode49', 'barcode50', 'barcode51']

rule all:
    input:
        expand([
            # zip fq
            "zipFastq/{barcode}/{barcode}.zip",

            # guppyplex
            "guppyplex/{barcode}/{barcode}.fastq",

            # nanopolish
            "nanopolish/{barcode}",

            # directory where the logs will be moved to    
            "logs/{barcode}"
        ], barcode = barcodes)

rule zipFq:
    input: 
        FQ = f"{FASTQ_PATH}/{{barcode}}"
    output:
        "zipFastq/{barcode}/{barcode}.zip"
    shell:
        "zip {output} {input.FQ}/*"


rule guppyplex:
    input:
        FQ = f"{FASTQ_PATH}/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
    output:
        "guppyplex/{barcode}/{barcode}.fastq"
    shell:
        "/home/ngs/miniconda3/envs/artic-ncov2019/bin/artic guppyplex --skip-quality-check --min-length {MINLENGTHGUPPY} --max-length {MAXLENGTHGUPPY} --directory {input.FQ} --prefix {wildcards.barcode} --output {output}" # variables in CAPITALS are parsed from config.yaml


rule minion:
    input:
        INFQ = rules.guppyplex.output,
        FAST5 = f"{FAST5_PATH}/{{barcode}}"
    params:
        OUTDIR = "nanopolish/{barcode}"
    output:
        directory("nanopolish/{barcode}")
    shell:
        """
        mkdir {params.OUTDIR};
        cd {params.OUTDIR};
        export PATH=/home/ngs/miniconda3/envs/artic-ncov2019/bin:$PATH;
        artic minion --normalise {NANOPOLISH_NORMALISE} --threads {THREADS} --scheme-directory {PRIMERSDIR} --read-file ../../{input.INFQ} --sequencing-summary {Seq_Sum} --fast5-directory {input.FAST5}  nCoV-2019/{PRIMERVERSION} {wildcards.barcode} # variables in CAPITALS are parsed from config.yaml
        """

rule mvQsubLogs:
    input:
        # zipFQ
        rules.zipFq.output,

        # guppyplex
        rules.guppyplex.output,

        # nanopolish
        rules.minion.output
    output:
        directory("logs/{barcode}")
    shell:
        "mkdir -p {output} \n"
        "mv {LOGDIR}/{wildcards.barcode}* {output}/"

上面的 snakemake 有效,现在我正在尝试添加另一个规则,但这里的区别是这个规则是一个聚合函数,即它不应该为每个 barcode 调用,但毕竟只调用一次所有条形码都调用了规则

我试图合并的规则 (catFasta) 会将 cat 所有 {barcode}.consensus.fasta (由规则 minion 生成)放入单个文件中,如下所示(合并到上面的 snakemake):

barcodes = ['barcode49', 'barcode50', 'barcode51']

rule all:
    input:
        expand([
            # zip fq
            "zipFastq/{barcode}/{barcode}.zip",

            # guppyplex
            "guppyplex/{barcode}/{barcode}.fastq",

            # nanopolish
            "nanopolish/{barcode}",
            
            # catFasta
            "catFasta/cat_consensus.fasta",

            # directory where the logs will be moved to    
            "logs/{barcode}"
        ], barcode = barcodes)

rule zipFq:
    input: 
        FQ = f"{FASTQ_PATH}/{{barcode}}"
    output:
        "zipFastq/{barcode}/{barcode}.zip"
    shell:
        "zip {output} {input.FQ}/*"


rule guppyplex:
    input:
        FQ = f"{FASTQ_PATH}/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
    output:
        "guppyplex/{barcode}/{barcode}.fastq"
    shell:
        "/home/ngs/miniconda3/envs/artic-ncov2019/bin/artic guppyplex --skip-quality-check --min-length {MINLENGTHGUPPY} --max-length {MAXLENGTHGUPPY} --directory {input.FQ} --prefix {wildcards.barcode} --output {output}" # variables in CAPITALS are parsed from config.yaml


rule minion:
    input:
        INFQ = rules.guppyplex.output,
        FAST5 = f"{FAST5_PATH}/{{barcode}}"
    params:
        OUTDIR = "nanopolish/{barcode}"
    output:
        directory("nanopolish/{barcode}")
    shell:
        """
        mkdir {params.OUTDIR};
        cd {params.OUTDIR};
        export PATH=/home/ngs/miniconda3/envs/artic-ncov2019/bin:$PATH;
        artic minion --normalise {NANOPOLISH_NORMALISE} --threads {THREADS} --scheme-directory {PRIMERSDIR} --read-file ../../{input.INFQ} --sequencing-summary {Seq_Sum} --fast5-directory {input.FAST5}  nCoV-2019/{PRIMERVERSION} {wildcards.barcode} # variables in CAPITALS are parsed from config.yaml
        """

rule catFasta:
    input:
        expand("nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
    output:
        "catFasta/cat_consensus.fasta"
    shell:
        "cat {input} > {output}"

rule mvQsubLogs:
    input:
        # zipFQ
        rules.zipFq.output,

        # guppyplex
        rules.guppyplex.output,

        # nanopolish
        rules.minion.output,

        # catFasta
        rules.catFasta.output
    output:
        directory("logs/{barcode}")
    shell:
        "mkdir -p {output} \n"
        "mv {LOGDIR}/{wildcards.barcode}* {output}/"

然而,当我用

调用 snakemake 时
(artic-ncov2019) ngs@bngs05b:/nexusb/SC2/ONT/scripts/SnakeMake> snakemake -np -s Snakefile_v2 --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" all # LOGDIR parsed from config.yaml

我得到:

Building DAG of jobs...
MissingInputException in line 178 of /nexusb/SC2/ONT/scripts/SnakeMake/Snakefile_v2:
Missing input files for rule guppyplex:
/nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/barcode49/barcode49.consensus.fasta

我觉得这不容易理解:snakemake 抱怨 /nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/barcode49/barcode49.consensus.fasta/nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/FASTQ_PATH 而我没有在任何地方定义 f"{FASTQ_PATH}/{{barcode}}.consensus.fasta"

描述了一个完全相同的问题 here,尽管接受的答案中的策略(规则 catFasta 的输入将是 expand("nanopolish/{{barcode}}/{{barcode}}.consensus.fasta"))对我不起作用。

有谁知道我该如何规避这个问题?

失败的规则是 rule guppyplex,它查找 {FASTQ_PATH}/{{barcode}}.

形式的输入

看起来通配符 {barcode} 填充了 barcode49/barcode49.consensus.fasta,我认为这是由于两个原因造成的:

首先(也是最重要的):工作流没有找到更好的方法来产生最终输出。在 rule catFasta 中,您提供了一个输入文件,该文件在您的工作流程中从未被描述为输出。 rule minion 将目录作为输出,但没有文件,并且对于生成此输入文件的工作流程并不完全清楚。

因此推断 {barcode} 通配符必须以某种方式包含它以前从未见过的 .consensus.fasta。这个通配符然后被移交给顶部,工作流在那里崩溃,因为它找不到匹配的输入文件。

其二:这个通配符的初始化with sth。你不想要的是唯一可能的,因为你没有正确约束通配符。例如,您可以禁止通配符包含 .(参见 wildcard_constraints here

但是,主要的问题是catFasta没有找到想要的输入。我建议将 minion 的输出更改为 "nanopolish/{barcode}/{barcode}.consensus.fasta",因为您已经从参数中获取了 OUTDIR,这不会影响您的规则。

编辑:虚拟测试示例:

barcodes = ['barcode49', 'barcode50', 'barcode51']

rule all:
    input:
        expand([
            # guppyplex
            "guppyplex/{barcode}/{barcode}.fastq",

            # catFasta
            "catFasta/cat_consensus.fasta",
        ], barcode = barcodes)


rule guppyplex:
    input:
        FQ = f"fastq/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
    output:
        "guppyplex/{barcode}/{barcode}.fastq"
    shell:
        "touch {output}" # variables in CAPITALS are parsed from config.yaml


rule minion:
    input:
        INFQ = rules.guppyplex.output,
        FAST5 = f"fasta/{{barcode}}"
    params:
        OUTDIR = "nanopolish/{barcode}"
    output:
        "nanopolish/{barcode}/{barcode}.consensus.fasta"
    shell:
        """
        touch {output} && echo {wildcards.barcode} > {output}
        """

rule catFasta:
    input:
        expand("nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
    output:
        "catFasta/cat_consensus.fasta"
    shell:
        "cat {input} > {output}"