Snakemake：MissingInputException - 缺少规则所有的输入文件

Question

我正在尝试为 whatshap 单倍型调用者创建一个 snakemake 工作流程，但我正在为 MissingInputException 错误而苦苦挣扎。这是我得到的：

Building DAG of jobs...
MissingInputException in line 9 of /srv/KLN/users/esv/KRISPS/whatshap/phased_illumina_FILT5/snakefile:
Missing input files for rule all:
saturna/saturna_phased_illumina_FILT5.vcf.gz
stratos/stratos_phased_illumina_FILT5.vcf.gz
ydun/ydun_phased_illumina_FILT5.vcf.gz
12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz
verdi/verdi_phased_illumina_FILT5.vcf.gz
wotan/wotan_phased_illumina_FILT5.vcf.gz
avarna/avarna_phased_illumina_FILT5.vcf.gz
avenue/avenue_phased_illumina_FILT5.vcf.gz
seresta/seresta_phased_illumina_FILT5.vcf.gz
15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz

如果我删除“全部规则”并尝试生成单个文件，我会收到此错误：

Building DAG of jobs...
MissingRuleException:
No rule to produce 12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz (if you use input functions make sure that they don't raise unexpected exceptions).

我错过了什么？我是 snakemake 的新手，所以也许（希望如此）这只是一个基本错误。这是我的代码：

shell.prefix("module load whatshap/1.1-foss-2020b-Python-3.8.6; module load BCFtools/1.11-GCC-10.2.0; ")

reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= ['chr01','chr02','chr03','chr04','chr05','chr06','chr07','chr08','chr09','chr10','chr11','chr12']

rule all:
    input:  
        "12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz",
        "15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz",
        "avarna/avarna_phased_illumina_FILT5.vcf.gz",
        "avenue/avenue_phased_illumina_FILT5.vcf.gz",
        "Kuras/Kuras_phased_illumina_FILT5.vcf.gz",
        "saturna/saturna_phased_illumina_FILT5.vcf.gz",
        "seresta/seresta_phased_illumina_FILT5.vcf.gz",
        "stratos/stratos_phased_illumina_FILT5.vcf.gz",
        "verdi/verdi_phased_illumina_FILT5.vcf.gz",
        "wotan/wotan_phased_illumina_FILT5.vcf.gz",
        "ydun/ydun_phased_illumina_FILT5.vcf.gz"


rule HaplotypeCalling:
    input:
        reference = reference,
        vcf = "/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf",
        bam = "/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam"
    output:
        "temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    params:
        chroms = chroms
    shell:
        "whatshap polyphase --ploidy 4 -o {output} --reference {input.reference} {input.vcf} {input.bam} --sample {wildcards.sample} --chromosome {params.chroms}"


rule SplitVCF:
    input:
        "temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    shell:
        "bcftools view -s {wildcards.sample} -o {output} {input}"
        

rule ConcatVCF:
    input:
        expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf"
    shell:
        "bcftools concat {input} -o {output}"


rule GZipVCF:
    input:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf"    
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz"
    shell:
        "bgzip -c {input} > {output}"

编辑：假设我只有两条染色体（chr01 和 chr02）（示例是针对样本 ydun），我期望每个样本的命令是这些：

#rule HaplotypeCalling
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr01.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr01
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr02.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr02.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr02

#rule SplitVCF
bcftools view -s ydun -o ydun/ydun_phased_illumina_FILT5_chr01.vcf temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf

#rule ConcatVCF
bcftools concat ydun/ydun_phased_illumina_FILT5_chr01.vcf ydun/ydun_phased_illumina_FILT5_chr02.vcf -o ydun/ydun_phased_illumina_FILT5.vcf

#rule GZipVCF
bgzip -c ydun/ydun_phased_illumina_FILT5.vcf > ydun/ydun_phased_illumina_FILT5.vcf.gz

Answer 1

Troy Comi 已经在评论中回答了你的问题，但我会进一步解释。

确实，去掉双括号会有帮助。单大括号和双大括号的区别在于双大括号对符号 '{' 和 '}' 进行了转义。换句话说，每当 Snakemake 在输出部分遇到像 "{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz" 这样的字符串时，它会将 {sample} 视为通配符，将 {{sample}} 视为字符串 "{sample}"。所以它会尝试找到它肯定找不到的 {sample}/saturna_phased_illumina_FILT5.vcf.gz 之类的文件。

如果在 expand 函数中使用此字符串，问题就完全不同了：

expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)

此处函数将 {sample} 和 {chrom} 替换为您作为参数 samples 和 chroms 提供的列表中的值。 {{sample}} 正在转换为字符串 "{sample}"，但这还没有结束。此转换结果 "{sample}" 被视为通配符 {sample}。例如，考虑规则：

samples = ['12NAE3', '15NOH7']
chroms = ['chr01','chr02']

rule ConcatVCF:
    input:
        expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf"

这条规则等同于这条规则，其中 {wildcard} 是实际的通配符（因此名称无关紧要），{{sample}} 是一个字符串 "{sample}":

rule ConcatVCF:
    input:
        "{wildcard}/12NAE3_phased_illumina_FILT5_chr01.vcf",
        "{wildcard}/12NAE3_phased_illumina_FILT5_chr02.vcf",
        "{wildcard}/15NOH7_phased_illumina_FILT5_chr01.vcf",
        "{wildcard}/15NOH7_phased_illumina_FILT5_chr02.vcf"
    output:
        "{{sample}}/{wildcard}_phased_illumina_FILT5.vcf"

那绝对不是你的意思。删除双括号你使这个规则等同于：

rule ConcatVCF:
    input:
        "12NAE3/12NAE3_phased_illumina_FILT5_chr01.vcf",
        "12NAE3/12NAE3_phased_illumina_FILT5_chr02.vcf",
        "15NOH7/15NOH7_phased_illumina_FILT5_chr01.vcf",
        "15NOH7/15NOH7_phased_illumina_FILT5_chr02.vcf"
    output:
        "{wildcard}/{wildcard}_phased_illumina_FILT5.vcf"

Answer 2

根据您的预期命令，这是我重构您的工作流程的尝试。我已发表评论以解释更改。

# moved shell prefix to envmodules directive
# https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules
# that's better for if you need a different set of modules later

reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= [f'chr{chrom:02}' for chrom in range(1, 13)]

# place filenames together or (better) in a config.yaml
phased_output = '{sample}/{sample}_phased_illumina_FILT5.vcf.gz'
variants = '/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf'
sample_bam = '/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam'
temp_vcf = 'temp/{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'
split_vcf = '{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'


# don't repeat yourself with sample names or phased output
# using expand makes it easier to add new samples later or change your
# filename
rule all:
    input:  
        expand(phased_output, sample=samples)

# using variables for filenames makes rules clearer
# Based on your sample, I think you should pass in wildcards.chrom
# instead of the chroms list
# For shell, I like to format the calls with an option per line
# it makes it easier to see all options and change or remove them.
# snakemake will combine all the lines so note the spaces at the end of
# each line!
rule HaplotypeCalling:
    input:
        reference = reference,
        vcf = variants,
        bam = sample_bam,
    output:
        temp_vcf
    envmodules:
        'whatshap/1.1-foss-2020b-Python-3.8.6'
    shell:
        "whatshap polyphase "
            "--ploidy 4 "
            "-o {output} "
            "--sample {wildcards.sample} "
            "--chromosome {wildcards.chroms}"
            "--reference {input.reference} {input.vcf} {input.bam} "

# options before arguments
rule SplitVCF:
    input:
        temp_vcf
    output:
        split_vcf
    envmodules:
        'BCFtools/1.11-GCC-10.2.0'
    shell:
        "bcftools view "
            "-s {wildcards.sample} "
            "-o {output} "
            "{input}"
        
# use output type option of bcf tools to skip the bgzip step
rule ConcatVCF:
    input:
        expand(split_vcf, chrom=chroms, allow_missing=true)
    output:
        phased_output
    envmodules:
        'BCFtools/1.11-GCC-10.2.0'
    shell:
        "bcftools concat "
            "-o {output} "
            "-O z "  # compressed vcf output
            "{input} "

未测试，但这是我第一次通过它！

花一些时间浏览文档中的规则页面并完成本教程。通配符很重要但非常微妙。这是我给的 workshop 的一些 material。它有点过时，但核心 material 仍然不错。

Snakemake：MissingInputException - 缺少规则所有的输入文件

Snakemake: MissingInputException - Missing input files for rule all

python

snakemake