Snakemake 使用 config.yaml 运行规则太多次

Snakemake runs rule too many times using config.yaml

我正在尝试创建此 snakemake 工作流程,该工作流程将使用 FastQc 评估原始读取质量并使用 MultiQC 创建 raport。我使用了 4 个输入文件并获得了预期的结果,但是我只是注意到每个规则得到 运行 4 次并且每次都接受所有 4 个输入,我不确定如何解决这个问题。谁能帮我弄清楚如何:

我正在尝试遵循 snakemake tutorial 但到目前为止运气不佳。

Snakefile:

configfile: "config.yaml"

rule all:
    input:
       expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
        
rule raw_fastqc:
    input:
        expand("data/samples/{sample}.fastq", sample=config["samples"])
    output:
        "outputs/fastqc_1/{sample}_fastqc.html",
        "outputs/fastqc_1/{sample}_fastqc.zip"
    shell:
        "fastqc {input} -o outputs/fastqc_1/"

rule raw_multiqc:
    input:
        expand("outputs/fastqc_1/{sample}_fastqc.html", sample=config["samples"]),
        expand("outputs/fastqc_1/{sample}_fastqc.zip", sample=config["samples"])
    output:
        "outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
    shell:
        "multiqc ./outputs/fastqc_1/ -n {output}"

config.yaml 文件:

samples:
    Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq
    Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq
    KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq
    KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq

我运行 snakemake 使用命令:

snakemake -s Snakefile --core 1

每条规则运行4次:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job            count    min threads    max threads
-----------  -------  -------------  -------------
all                1              1              1
raw_fastqc         4              1              1
raw_multiqc        4              1              1
total              9              1              1

但是每次都使用所有 4 个输入:

[Sun May 15 23:06:22 2022]
rule raw_fastqc:
    input: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq, data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
    output: outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.html, outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.zip
    jobid: 3
    wildcards: sample=Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001
    resources: tmpdir=/tmp

您的问题是在每个规则的输入中使用 expand()。因为 expand 填写通配符值,所以您只需在 all 规则中执行此操作,因为通配符值会传递给上游规则。

蛇文件:

configfile: "config.yaml"

rule all:
    input:
       expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
        
rule raw_fastqc:
    input:
       "data/samples/{sample}.fastq"
    output:
        "outputs/fastqc_1/{sample}_fastqc.html",
        "outputs/fastqc_1/{sample}_fastqc.zip"
    shell:
        "fastqc {input} -o outputs/fastqc_1/"

rule raw_multiqc:
    input:
       "outputs/fastqc_1/{sample}_fastqc.html", 
       "outputs/fastqc_1/{sample}_fastqc.zip", 
    output:
        "outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
    shell:
        "multiqc ./outputs/fastqc_1/ -n {output}"