Snakemake:MissingInputException - 缺少规则所有的输入文件
Snakemake: MissingInputException - Missing input files for rule all
我正在尝试为 whatshap 单倍型调用者创建一个 snakemake 工作流程,但我正在为 MissingInputException 错误而苦苦挣扎。这是我得到的:
Building DAG of jobs...
MissingInputException in line 9 of /srv/KLN/users/esv/KRISPS/whatshap/phased_illumina_FILT5/snakefile:
Missing input files for rule all:
saturna/saturna_phased_illumina_FILT5.vcf.gz
stratos/stratos_phased_illumina_FILT5.vcf.gz
ydun/ydun_phased_illumina_FILT5.vcf.gz
12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz
verdi/verdi_phased_illumina_FILT5.vcf.gz
wotan/wotan_phased_illumina_FILT5.vcf.gz
avarna/avarna_phased_illumina_FILT5.vcf.gz
avenue/avenue_phased_illumina_FILT5.vcf.gz
seresta/seresta_phased_illumina_FILT5.vcf.gz
15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz
如果我删除“全部规则”并尝试生成单个文件,我会收到此错误:
Building DAG of jobs...
MissingRuleException:
No rule to produce 12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz (if you use input functions make sure that they don't raise unexpected exceptions).
我错过了什么?我是 snakemake 的新手,所以也许(希望如此)这只是一个基本错误。这是我的代码:
shell.prefix("module load whatshap/1.1-foss-2020b-Python-3.8.6; module load BCFtools/1.11-GCC-10.2.0; ")
reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= ['chr01','chr02','chr03','chr04','chr05','chr06','chr07','chr08','chr09','chr10','chr11','chr12']
rule all:
input:
"12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz",
"15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz",
"avarna/avarna_phased_illumina_FILT5.vcf.gz",
"avenue/avenue_phased_illumina_FILT5.vcf.gz",
"Kuras/Kuras_phased_illumina_FILT5.vcf.gz",
"saturna/saturna_phased_illumina_FILT5.vcf.gz",
"seresta/seresta_phased_illumina_FILT5.vcf.gz",
"stratos/stratos_phased_illumina_FILT5.vcf.gz",
"verdi/verdi_phased_illumina_FILT5.vcf.gz",
"wotan/wotan_phased_illumina_FILT5.vcf.gz",
"ydun/ydun_phased_illumina_FILT5.vcf.gz"
rule HaplotypeCalling:
input:
reference = reference,
vcf = "/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf",
bam = "/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam"
output:
"temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
params:
chroms = chroms
shell:
"whatshap polyphase --ploidy 4 -o {output} --reference {input.reference} {input.vcf} {input.bam} --sample {wildcards.sample} --chromosome {params.chroms}"
rule SplitVCF:
input:
"temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
output:
"{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
shell:
"bcftools view -s {wildcards.sample} -o {output} {input}"
rule ConcatVCF:
input:
expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
output:
"{{sample}}/{sample}_phased_illumina_FILT5.vcf"
shell:
"bcftools concat {input} -o {output}"
rule GZipVCF:
input:
"{{sample}}/{sample}_phased_illumina_FILT5.vcf"
output:
"{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz"
shell:
"bgzip -c {input} > {output}"
编辑:假设我只有两条染色体(chr01 和 chr02)(示例是针对样本 ydun),我期望每个样本的命令是这些:
#rule HaplotypeCalling
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr01.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr01
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr02.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr02.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr02
#rule SplitVCF
bcftools view -s ydun -o ydun/ydun_phased_illumina_FILT5_chr01.vcf temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf
#rule ConcatVCF
bcftools concat ydun/ydun_phased_illumina_FILT5_chr01.vcf ydun/ydun_phased_illumina_FILT5_chr02.vcf -o ydun/ydun_phased_illumina_FILT5.vcf
#rule GZipVCF
bgzip -c ydun/ydun_phased_illumina_FILT5.vcf > ydun/ydun_phased_illumina_FILT5.vcf.gz
Troy Comi 已经在评论中回答了你的问题,但我会进一步解释。
确实,去掉双括号会有帮助。单大括号和双大括号的区别在于双大括号对符号 '{'
和 '}'
进行了转义。换句话说,每当 Snakemake 在输出部分遇到像 "{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz"
这样的字符串时,它会将 {sample}
视为通配符,将 {{sample}}
视为字符串 "{sample}"
。所以它会尝试找到它肯定找不到的 {sample}/saturna_phased_illumina_FILT5.vcf.gz
之类的文件。
如果在 expand
函数中使用此字符串,问题就完全不同了:
expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
此处函数将 {sample}
和 {chrom}
替换为您作为参数 samples
和 chroms
提供的列表中的值。 {{sample}}
正在转换为字符串 "{sample}"
,但这还没有结束。此转换结果 "{sample}"
被视为通配符 {sample}
。例如,考虑规则:
samples = ['12NAE3', '15NOH7']
chroms = ['chr01','chr02']
rule ConcatVCF:
input:
expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
output:
"{{sample}}/{sample}_phased_illumina_FILT5.vcf"
这条规则等同于这条规则,其中 {wildcard}
是实际的通配符(因此名称无关紧要),{{sample}}
是一个字符串 "{sample}"
:
rule ConcatVCF:
input:
"{wildcard}/12NAE3_phased_illumina_FILT5_chr01.vcf",
"{wildcard}/12NAE3_phased_illumina_FILT5_chr02.vcf",
"{wildcard}/15NOH7_phased_illumina_FILT5_chr01.vcf",
"{wildcard}/15NOH7_phased_illumina_FILT5_chr02.vcf"
output:
"{{sample}}/{wildcard}_phased_illumina_FILT5.vcf"
那绝对不是你的意思。删除双括号你使这个规则等同于:
rule ConcatVCF:
input:
"12NAE3/12NAE3_phased_illumina_FILT5_chr01.vcf",
"12NAE3/12NAE3_phased_illumina_FILT5_chr02.vcf",
"15NOH7/15NOH7_phased_illumina_FILT5_chr01.vcf",
"15NOH7/15NOH7_phased_illumina_FILT5_chr02.vcf"
output:
"{wildcard}/{wildcard}_phased_illumina_FILT5.vcf"
根据您的预期命令,这是我重构您的工作流程的尝试。我已发表评论以解释更改。
# moved shell prefix to envmodules directive
# https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules
# that's better for if you need a different set of modules later
reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= [f'chr{chrom:02}' for chrom in range(1, 13)]
# place filenames together or (better) in a config.yaml
phased_output = '{sample}/{sample}_phased_illumina_FILT5.vcf.gz'
variants = '/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf'
sample_bam = '/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam'
temp_vcf = 'temp/{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'
split_vcf = '{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'
# don't repeat yourself with sample names or phased output
# using expand makes it easier to add new samples later or change your
# filename
rule all:
input:
expand(phased_output, sample=samples)
# using variables for filenames makes rules clearer
# Based on your sample, I think you should pass in wildcards.chrom
# instead of the chroms list
# For shell, I like to format the calls with an option per line
# it makes it easier to see all options and change or remove them.
# snakemake will combine all the lines so note the spaces at the end of
# each line!
rule HaplotypeCalling:
input:
reference = reference,
vcf = variants,
bam = sample_bam,
output:
temp_vcf
envmodules:
'whatshap/1.1-foss-2020b-Python-3.8.6'
shell:
"whatshap polyphase "
"--ploidy 4 "
"-o {output} "
"--sample {wildcards.sample} "
"--chromosome {wildcards.chroms}"
"--reference {input.reference} {input.vcf} {input.bam} "
# options before arguments
rule SplitVCF:
input:
temp_vcf
output:
split_vcf
envmodules:
'BCFtools/1.11-GCC-10.2.0'
shell:
"bcftools view "
"-s {wildcards.sample} "
"-o {output} "
"{input}"
# use output type option of bcf tools to skip the bgzip step
rule ConcatVCF:
input:
expand(split_vcf, chrom=chroms, allow_missing=true)
output:
phased_output
envmodules:
'BCFtools/1.11-GCC-10.2.0'
shell:
"bcftools concat "
"-o {output} "
"-O z " # compressed vcf output
"{input} "
未测试,但这是我第一次通过它!
花一些时间浏览文档中的规则页面并完成本教程。通配符很重要但非常微妙。这是我给的 workshop 的一些 material。它有点过时,但核心 material 仍然不错。
我正在尝试为 whatshap 单倍型调用者创建一个 snakemake 工作流程,但我正在为 MissingInputException 错误而苦苦挣扎。这是我得到的:
Building DAG of jobs...
MissingInputException in line 9 of /srv/KLN/users/esv/KRISPS/whatshap/phased_illumina_FILT5/snakefile:
Missing input files for rule all:
saturna/saturna_phased_illumina_FILT5.vcf.gz
stratos/stratos_phased_illumina_FILT5.vcf.gz
ydun/ydun_phased_illumina_FILT5.vcf.gz
12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz
verdi/verdi_phased_illumina_FILT5.vcf.gz
wotan/wotan_phased_illumina_FILT5.vcf.gz
avarna/avarna_phased_illumina_FILT5.vcf.gz
avenue/avenue_phased_illumina_FILT5.vcf.gz
seresta/seresta_phased_illumina_FILT5.vcf.gz
15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz
如果我删除“全部规则”并尝试生成单个文件,我会收到此错误:
Building DAG of jobs...
MissingRuleException:
No rule to produce 12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz (if you use input functions make sure that they don't raise unexpected exceptions).
我错过了什么?我是 snakemake 的新手,所以也许(希望如此)这只是一个基本错误。这是我的代码:
shell.prefix("module load whatshap/1.1-foss-2020b-Python-3.8.6; module load BCFtools/1.11-GCC-10.2.0; ")
reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= ['chr01','chr02','chr03','chr04','chr05','chr06','chr07','chr08','chr09','chr10','chr11','chr12']
rule all:
input:
"12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz",
"15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz",
"avarna/avarna_phased_illumina_FILT5.vcf.gz",
"avenue/avenue_phased_illumina_FILT5.vcf.gz",
"Kuras/Kuras_phased_illumina_FILT5.vcf.gz",
"saturna/saturna_phased_illumina_FILT5.vcf.gz",
"seresta/seresta_phased_illumina_FILT5.vcf.gz",
"stratos/stratos_phased_illumina_FILT5.vcf.gz",
"verdi/verdi_phased_illumina_FILT5.vcf.gz",
"wotan/wotan_phased_illumina_FILT5.vcf.gz",
"ydun/ydun_phased_illumina_FILT5.vcf.gz"
rule HaplotypeCalling:
input:
reference = reference,
vcf = "/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf",
bam = "/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam"
output:
"temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
params:
chroms = chroms
shell:
"whatshap polyphase --ploidy 4 -o {output} --reference {input.reference} {input.vcf} {input.bam} --sample {wildcards.sample} --chromosome {params.chroms}"
rule SplitVCF:
input:
"temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
output:
"{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
shell:
"bcftools view -s {wildcards.sample} -o {output} {input}"
rule ConcatVCF:
input:
expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
output:
"{{sample}}/{sample}_phased_illumina_FILT5.vcf"
shell:
"bcftools concat {input} -o {output}"
rule GZipVCF:
input:
"{{sample}}/{sample}_phased_illumina_FILT5.vcf"
output:
"{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz"
shell:
"bgzip -c {input} > {output}"
编辑:假设我只有两条染色体(chr01 和 chr02)(示例是针对样本 ydun),我期望每个样本的命令是这些:
#rule HaplotypeCalling
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr01.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr01
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr02.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr02.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr02
#rule SplitVCF
bcftools view -s ydun -o ydun/ydun_phased_illumina_FILT5_chr01.vcf temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf
#rule ConcatVCF
bcftools concat ydun/ydun_phased_illumina_FILT5_chr01.vcf ydun/ydun_phased_illumina_FILT5_chr02.vcf -o ydun/ydun_phased_illumina_FILT5.vcf
#rule GZipVCF
bgzip -c ydun/ydun_phased_illumina_FILT5.vcf > ydun/ydun_phased_illumina_FILT5.vcf.gz
Troy Comi 已经在评论中回答了你的问题,但我会进一步解释。
确实,去掉双括号会有帮助。单大括号和双大括号的区别在于双大括号对符号 '{'
和 '}'
进行了转义。换句话说,每当 Snakemake 在输出部分遇到像 "{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz"
这样的字符串时,它会将 {sample}
视为通配符,将 {{sample}}
视为字符串 "{sample}"
。所以它会尝试找到它肯定找不到的 {sample}/saturna_phased_illumina_FILT5.vcf.gz
之类的文件。
如果在 expand
函数中使用此字符串,问题就完全不同了:
expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
此处函数将 {sample}
和 {chrom}
替换为您作为参数 samples
和 chroms
提供的列表中的值。 {{sample}}
正在转换为字符串 "{sample}"
,但这还没有结束。此转换结果 "{sample}"
被视为通配符 {sample}
。例如,考虑规则:
samples = ['12NAE3', '15NOH7']
chroms = ['chr01','chr02']
rule ConcatVCF:
input:
expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
output:
"{{sample}}/{sample}_phased_illumina_FILT5.vcf"
这条规则等同于这条规则,其中 {wildcard}
是实际的通配符(因此名称无关紧要),{{sample}}
是一个字符串 "{sample}"
:
rule ConcatVCF:
input:
"{wildcard}/12NAE3_phased_illumina_FILT5_chr01.vcf",
"{wildcard}/12NAE3_phased_illumina_FILT5_chr02.vcf",
"{wildcard}/15NOH7_phased_illumina_FILT5_chr01.vcf",
"{wildcard}/15NOH7_phased_illumina_FILT5_chr02.vcf"
output:
"{{sample}}/{wildcard}_phased_illumina_FILT5.vcf"
那绝对不是你的意思。删除双括号你使这个规则等同于:
rule ConcatVCF:
input:
"12NAE3/12NAE3_phased_illumina_FILT5_chr01.vcf",
"12NAE3/12NAE3_phased_illumina_FILT5_chr02.vcf",
"15NOH7/15NOH7_phased_illumina_FILT5_chr01.vcf",
"15NOH7/15NOH7_phased_illumina_FILT5_chr02.vcf"
output:
"{wildcard}/{wildcard}_phased_illumina_FILT5.vcf"
根据您的预期命令,这是我重构您的工作流程的尝试。我已发表评论以解释更改。
# moved shell prefix to envmodules directive
# https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules
# that's better for if you need a different set of modules later
reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= [f'chr{chrom:02}' for chrom in range(1, 13)]
# place filenames together or (better) in a config.yaml
phased_output = '{sample}/{sample}_phased_illumina_FILT5.vcf.gz'
variants = '/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf'
sample_bam = '/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam'
temp_vcf = 'temp/{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'
split_vcf = '{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'
# don't repeat yourself with sample names or phased output
# using expand makes it easier to add new samples later or change your
# filename
rule all:
input:
expand(phased_output, sample=samples)
# using variables for filenames makes rules clearer
# Based on your sample, I think you should pass in wildcards.chrom
# instead of the chroms list
# For shell, I like to format the calls with an option per line
# it makes it easier to see all options and change or remove them.
# snakemake will combine all the lines so note the spaces at the end of
# each line!
rule HaplotypeCalling:
input:
reference = reference,
vcf = variants,
bam = sample_bam,
output:
temp_vcf
envmodules:
'whatshap/1.1-foss-2020b-Python-3.8.6'
shell:
"whatshap polyphase "
"--ploidy 4 "
"-o {output} "
"--sample {wildcards.sample} "
"--chromosome {wildcards.chroms}"
"--reference {input.reference} {input.vcf} {input.bam} "
# options before arguments
rule SplitVCF:
input:
temp_vcf
output:
split_vcf
envmodules:
'BCFtools/1.11-GCC-10.2.0'
shell:
"bcftools view "
"-s {wildcards.sample} "
"-o {output} "
"{input}"
# use output type option of bcf tools to skip the bgzip step
rule ConcatVCF:
input:
expand(split_vcf, chrom=chroms, allow_missing=true)
output:
phased_output
envmodules:
'BCFtools/1.11-GCC-10.2.0'
shell:
"bcftools concat "
"-o {output} "
"-O z " # compressed vcf output
"{input} "
未测试,但这是我第一次通过它!
花一些时间浏览文档中的规则页面并完成本教程。通配符很重要但非常微妙。这是我给的 workshop 的一些 material。它有点过时,但核心 material 仍然不错。