Snakemake:MissingInputException - 缺少规则所有的输入文件

Snakemake: MissingInputException - Missing input files for rule all

我正在尝试为 whatshap 单倍型调用者创建一个 snakemake 工作流程,但我正在为 MissingInputException 错误而苦苦挣扎。这是我得到的:

Building DAG of jobs...
MissingInputException in line 9 of /srv/KLN/users/esv/KRISPS/whatshap/phased_illumina_FILT5/snakefile:
Missing input files for rule all:
saturna/saturna_phased_illumina_FILT5.vcf.gz
stratos/stratos_phased_illumina_FILT5.vcf.gz
ydun/ydun_phased_illumina_FILT5.vcf.gz
12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz
verdi/verdi_phased_illumina_FILT5.vcf.gz
wotan/wotan_phased_illumina_FILT5.vcf.gz
avarna/avarna_phased_illumina_FILT5.vcf.gz
avenue/avenue_phased_illumina_FILT5.vcf.gz
seresta/seresta_phased_illumina_FILT5.vcf.gz
15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz

如果我删除“全部规则”并尝试生成单个文件,我会收到此错误:

Building DAG of jobs...
MissingRuleException:
No rule to produce 12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz (if you use input functions make sure that they don't raise unexpected exceptions).

我错过了什么?我是 snakemake 的新手,所以也许(希望如此)这只是一个基本错误。这是我的代码:

shell.prefix("module load whatshap/1.1-foss-2020b-Python-3.8.6; module load BCFtools/1.11-GCC-10.2.0; ")

reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= ['chr01','chr02','chr03','chr04','chr05','chr06','chr07','chr08','chr09','chr10','chr11','chr12']

rule all:
    input:  
        "12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz",
        "15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz",
        "avarna/avarna_phased_illumina_FILT5.vcf.gz",
        "avenue/avenue_phased_illumina_FILT5.vcf.gz",
        "Kuras/Kuras_phased_illumina_FILT5.vcf.gz",
        "saturna/saturna_phased_illumina_FILT5.vcf.gz",
        "seresta/seresta_phased_illumina_FILT5.vcf.gz",
        "stratos/stratos_phased_illumina_FILT5.vcf.gz",
        "verdi/verdi_phased_illumina_FILT5.vcf.gz",
        "wotan/wotan_phased_illumina_FILT5.vcf.gz",
        "ydun/ydun_phased_illumina_FILT5.vcf.gz"


rule HaplotypeCalling:
    input:
        reference = reference,
        vcf = "/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf",
        bam = "/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam"
    output:
        "temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    params:
        chroms = chroms
    shell:
        "whatshap polyphase --ploidy 4 -o {output} --reference {input.reference} {input.vcf} {input.bam} --sample {wildcards.sample} --chromosome {params.chroms}"


rule SplitVCF:
    input:
        "temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    shell:
        "bcftools view -s {wildcards.sample} -o {output} {input}"
        

rule ConcatVCF:
    input:
        expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf"
    shell:
        "bcftools concat {input} -o {output}"


rule GZipVCF:
    input:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf"    
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz"
    shell:
        "bgzip -c {input} > {output}"

编辑:假设我只有两条染色体(chr01 和 chr02)(示例是针对样本 ydun),我期望每个样本的命令是这些:

#rule HaplotypeCalling
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr01.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr01
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr02.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr02.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr02

#rule SplitVCF
bcftools view -s ydun -o ydun/ydun_phased_illumina_FILT5_chr01.vcf temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf

#rule ConcatVCF
bcftools concat ydun/ydun_phased_illumina_FILT5_chr01.vcf ydun/ydun_phased_illumina_FILT5_chr02.vcf -o ydun/ydun_phased_illumina_FILT5.vcf

#rule GZipVCF
bgzip -c ydun/ydun_phased_illumina_FILT5.vcf > ydun/ydun_phased_illumina_FILT5.vcf.gz

Troy Comi 已经在评论中回答了你的问题,但我会进一步解释。

确实,去掉双括号会有帮助。单大括号和双大括号的区别在于双大括号对符号 '{''}' 进行了转义。换句话说,每当 Snakemake 在输出部分遇到像 "{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz" 这样的字符串时,它会将 {sample} 视为通配符,将 {{sample}} 视为字符串 "{sample}"。所以它会尝试找到它肯定找不到的 {sample}/saturna_phased_illumina_FILT5.vcf.gz 之类的文件。

如果在 expand 函数中使用此字符串,问题就完全不同了:

expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)

此处函数将 {sample}{chrom} 替换为您作为参数 sampleschroms 提供的列表中的值。 {{sample}} 正在转换为字符串 "{sample}",但这还没有结束。此转换结果 "{sample}" 被视为通配符 {sample}。例如,考虑规则:

samples = ['12NAE3', '15NOH7']
chroms = ['chr01','chr02']

rule ConcatVCF:
    input:
        expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf"

这条规则等同于这条规则,其中 {wildcard} 是实际的通配符(因此名称无关紧要),{{sample}} 是一个字符串 "{sample}":

rule ConcatVCF:
    input:
        "{wildcard}/12NAE3_phased_illumina_FILT5_chr01.vcf",
        "{wildcard}/12NAE3_phased_illumina_FILT5_chr02.vcf",
        "{wildcard}/15NOH7_phased_illumina_FILT5_chr01.vcf",
        "{wildcard}/15NOH7_phased_illumina_FILT5_chr02.vcf"
    output:
        "{{sample}}/{wildcard}_phased_illumina_FILT5.vcf"

那绝对不是你的意思。删除双括号你使这个规则等同于:

rule ConcatVCF:
    input:
        "12NAE3/12NAE3_phased_illumina_FILT5_chr01.vcf",
        "12NAE3/12NAE3_phased_illumina_FILT5_chr02.vcf",
        "15NOH7/15NOH7_phased_illumina_FILT5_chr01.vcf",
        "15NOH7/15NOH7_phased_illumina_FILT5_chr02.vcf"
    output:
        "{wildcard}/{wildcard}_phased_illumina_FILT5.vcf"

根据您的预期命令,这是我重构您的工作流程的尝试。我已发表评论以解释更改。

# moved shell prefix to envmodules directive
# https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules
# that's better for if you need a different set of modules later

reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= [f'chr{chrom:02}' for chrom in range(1, 13)]

# place filenames together or (better) in a config.yaml
phased_output = '{sample}/{sample}_phased_illumina_FILT5.vcf.gz'
variants = '/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf'
sample_bam = '/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam'
temp_vcf = 'temp/{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'
split_vcf = '{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'


# don't repeat yourself with sample names or phased output
# using expand makes it easier to add new samples later or change your
# filename
rule all:
    input:  
        expand(phased_output, sample=samples)

# using variables for filenames makes rules clearer
# Based on your sample, I think you should pass in wildcards.chrom
# instead of the chroms list
# For shell, I like to format the calls with an option per line
# it makes it easier to see all options and change or remove them.
# snakemake will combine all the lines so note the spaces at the end of
# each line!
rule HaplotypeCalling:
    input:
        reference = reference,
        vcf = variants,
        bam = sample_bam,
    output:
        temp_vcf
    envmodules:
        'whatshap/1.1-foss-2020b-Python-3.8.6'
    shell:
        "whatshap polyphase "
            "--ploidy 4 "
            "-o {output} "
            "--sample {wildcards.sample} "
            "--chromosome {wildcards.chroms}"
            "--reference {input.reference} {input.vcf} {input.bam} "

# options before arguments
rule SplitVCF:
    input:
        temp_vcf
    output:
        split_vcf
    envmodules:
        'BCFtools/1.11-GCC-10.2.0'
    shell:
        "bcftools view "
            "-s {wildcards.sample} "
            "-o {output} "
            "{input}"
        
# use output type option of bcf tools to skip the bgzip step
rule ConcatVCF:
    input:
        expand(split_vcf, chrom=chroms, allow_missing=true)
    output:
        phased_output
    envmodules:
        'BCFtools/1.11-GCC-10.2.0'
    shell:
        "bcftools concat "
            "-o {output} "
            "-O z "  # compressed vcf output
            "{input} "

未测试,但这是我第一次通过它!

花一些时间浏览文档中的规则页面并完成本教程。通配符很重要但非常微妙。这是我给的 workshop 的一些 material。它有点过时,但核心 material 仍然不错。