在函数中定义输入文件后丢失输入文件
Missing input files after defining them in function
我正在尝试对打包的 RNAseq 数据进行 QC。我正在使用 Snakemake 作为工作流管理器,并且知道 Snakemake 不喜欢一对多规则。我定义一个检查点可以解决这个问题,但是当我 运行 脚本时,我得到了这条带有规则 fastqc 的错误消息。
MissingInputException in line 49 of /home/user/2022-h1n1/Snakefile:
Missing input files for rule fastqc:
raw/[]_R1.fastq.gz
这是我的 Snakefile,我在规则 download_data 中链接了数据。该文件为 61.1Gb,因此如果您的系统内存不足,则可能不值得下载。任何建议都会有所帮助!
# Snakemake file - input raw reads to generate quant files for analysis in R
configfile: "config.yaml"
import io
import os
import pandas as pd
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
PROJ = config["proj_name"]
INPUTDIR = config["raw-data"]
SCRATCH = config["scratch"]
REFERENCE = config["ref"]
OUTPUTDIR = config["outputDIR"]
# Adapters
SE_ADAPTER = config['seq']['SE']
SE_SEQUENCE = config['seq']['trueseq-se']
# Organsim
TRANSCRIPTOME = config['transcriptome']['human']
SPECIES = config['species']['human']
SAMPLE_LIST = glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz")
rule all:
input:
"finished.txt",
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
raw_qc = expand( SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand( SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
rule download_data:
output: "high_quality_files.tgz"
shell: "curl -L -o {output} https://osf.io/pcxfg/download"
checkpoint decompress_h1n1:
output: directory(INPUTDIR)
input: "high_quality_files.tgz"
params: INPUTDIR
shell:
'''
mkdir -p {params}
tar xzvf {input} -C {params}
'''
rule fastqc:
input: INPUTDIR + "{basenames}_R1.fastq.gz"
output:
raw_html = SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
raw_zip = SCRATCH + "fastqc/{basenames}_R1_fastqc.zip"
wrapper:
"0.80.3/bio/fastqc"
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
rule download_trimmomatic_adapter_file:
output: REFERENCE + SE_ADAPTER
shell:
"""
curl -L {SE_SEQUENCE} -o {output}
"""
rule trimmmomatic_se:
input:
reads= INPUTDIR + "{basenames}_R1.fastq.gz",
adapters= REFERENCE + SE_ADAPTER,
output:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
unpaired = SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz"
conda: "rnaseq.yml"
shell:
"""
trimmomatic SE {input.reads} \
{output.reads} {output.unpaired} \
ILLUMINACLIP:{input.adapters}:2:0:15 LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 MINLEN:25
"""
rule fastqc_trim:
input: SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz"
output:
html = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
zip = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip"
params: ""
log:
SCRATCH + "logs/fastqc/{basenames}_R1_trimmed.log"
wrapper:
"0.35.2/bio/fastqc"
rule multiqc:
input:
raw_qc = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand(SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
output:
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
trim_multi_html = SCRATCH + "fastqc/trimmed_multiqc.html",
trim_multi_stats = SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
conda: "env/rnaseq.yml"
shell:
"""
multiqc -n multiqc.html {input.raw_qc} #run multiqc
mv multiqc.html {output.raw_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.raw_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
#repeat for trimmed data
multiqc -n multiqc.html {input.trim_qc} #run multiqc
mv multiqc.html {output.trim_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.trim_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
"""
rule download_transcriptome:
output: REFERENCE + SPECIES
shell:
"""
curl -L {TRANSCRIPTOME} -o {output}
"""
rule salmon_index:
input:
ref = REFERENCE + SPECIES
output: directory(OUTPUTDIR + "quant/sc_ensembl_index")
conda: "env/rnaseq.yml"
shell:
"""
salmon index --index {output} --transcripts {input} # --type quasi
"""
rule salmon_quant:
input:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
index_dir = OUTPUTDIR + "quant/sc_ensembl_index"
output: OUTPUTDIR + "{basenames}_quant/quant.sf"
params:
outdir= lambda wildcards: OUTPUTDIR + wildcards.sample + "_quant"
conda: "env/rnaseq.yml"
shell:
"""
salmon quant -i {input.index_dir} --libType A -r {input.reads} -o {params.outdir} --seqBias --gcBias --validateMappings
"""
rule finished:
input: aggregate_decompress_h1n1
output: "finished.txt"
shell:
'''
touch {output}
'''
首先,glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz")
returns 一个包含每个通配符的 key:value 对的通配符对象。如果你想获得基本名称,它应该是 glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz").basenames
其次,我假设所有 fastq.gz 文件都是由 decompress_h1n1
检查点生成的,因为您已经在 aggregate_decompress_h1n1
函数中包含了 fastqc 输出。你不应该在规则 all
中再次包含这些输出,这会导致 snakemake 在执行检查点之前尝试执行 fastqc。
第三,你也应该把你的trim_qc
输出放在aggregate_decompress_h1n1
函数中,基本上它和fastqc是同一个问题,salmon
相关规则可能也是一样。
可能有一个潜在的问题,我注意到你使用了 fastqc 的包装器,我记得官方的 fastqc 包装器要求输出必须有 html
和 zip
,而你有一个 raw
前缀。但是我在官方文档中没有看到 0.80.3
版本,不确定你是否使用其他存储库来访问 wrapper
我正在尝试对打包的 RNAseq 数据进行 QC。我正在使用 Snakemake 作为工作流管理器,并且知道 Snakemake 不喜欢一对多规则。我定义一个检查点可以解决这个问题,但是当我 运行 脚本时,我得到了这条带有规则 fastqc 的错误消息。
MissingInputException in line 49 of /home/user/2022-h1n1/Snakefile:
Missing input files for rule fastqc:
raw/[]_R1.fastq.gz
这是我的 Snakefile,我在规则 download_data 中链接了数据。该文件为 61.1Gb,因此如果您的系统内存不足,则可能不值得下载。任何建议都会有所帮助!
# Snakemake file - input raw reads to generate quant files for analysis in R
configfile: "config.yaml"
import io
import os
import pandas as pd
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
PROJ = config["proj_name"]
INPUTDIR = config["raw-data"]
SCRATCH = config["scratch"]
REFERENCE = config["ref"]
OUTPUTDIR = config["outputDIR"]
# Adapters
SE_ADAPTER = config['seq']['SE']
SE_SEQUENCE = config['seq']['trueseq-se']
# Organsim
TRANSCRIPTOME = config['transcriptome']['human']
SPECIES = config['species']['human']
SAMPLE_LIST = glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz")
rule all:
input:
"finished.txt",
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
raw_qc = expand( SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand( SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
rule download_data:
output: "high_quality_files.tgz"
shell: "curl -L -o {output} https://osf.io/pcxfg/download"
checkpoint decompress_h1n1:
output: directory(INPUTDIR)
input: "high_quality_files.tgz"
params: INPUTDIR
shell:
'''
mkdir -p {params}
tar xzvf {input} -C {params}
'''
rule fastqc:
input: INPUTDIR + "{basenames}_R1.fastq.gz"
output:
raw_html = SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
raw_zip = SCRATCH + "fastqc/{basenames}_R1_fastqc.zip"
wrapper:
"0.80.3/bio/fastqc"
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
rule download_trimmomatic_adapter_file:
output: REFERENCE + SE_ADAPTER
shell:
"""
curl -L {SE_SEQUENCE} -o {output}
"""
rule trimmmomatic_se:
input:
reads= INPUTDIR + "{basenames}_R1.fastq.gz",
adapters= REFERENCE + SE_ADAPTER,
output:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
unpaired = SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz"
conda: "rnaseq.yml"
shell:
"""
trimmomatic SE {input.reads} \
{output.reads} {output.unpaired} \
ILLUMINACLIP:{input.adapters}:2:0:15 LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 MINLEN:25
"""
rule fastqc_trim:
input: SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz"
output:
html = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
zip = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip"
params: ""
log:
SCRATCH + "logs/fastqc/{basenames}_R1_trimmed.log"
wrapper:
"0.35.2/bio/fastqc"
rule multiqc:
input:
raw_qc = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand(SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
output:
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
trim_multi_html = SCRATCH + "fastqc/trimmed_multiqc.html",
trim_multi_stats = SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
conda: "env/rnaseq.yml"
shell:
"""
multiqc -n multiqc.html {input.raw_qc} #run multiqc
mv multiqc.html {output.raw_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.raw_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
#repeat for trimmed data
multiqc -n multiqc.html {input.trim_qc} #run multiqc
mv multiqc.html {output.trim_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.trim_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
"""
rule download_transcriptome:
output: REFERENCE + SPECIES
shell:
"""
curl -L {TRANSCRIPTOME} -o {output}
"""
rule salmon_index:
input:
ref = REFERENCE + SPECIES
output: directory(OUTPUTDIR + "quant/sc_ensembl_index")
conda: "env/rnaseq.yml"
shell:
"""
salmon index --index {output} --transcripts {input} # --type quasi
"""
rule salmon_quant:
input:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
index_dir = OUTPUTDIR + "quant/sc_ensembl_index"
output: OUTPUTDIR + "{basenames}_quant/quant.sf"
params:
outdir= lambda wildcards: OUTPUTDIR + wildcards.sample + "_quant"
conda: "env/rnaseq.yml"
shell:
"""
salmon quant -i {input.index_dir} --libType A -r {input.reads} -o {params.outdir} --seqBias --gcBias --validateMappings
"""
rule finished:
input: aggregate_decompress_h1n1
output: "finished.txt"
shell:
'''
touch {output}
'''
首先,glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz")
returns 一个包含每个通配符的 key:value 对的通配符对象。如果你想获得基本名称,它应该是 glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz").basenames
其次,我假设所有 fastq.gz 文件都是由 decompress_h1n1
检查点生成的,因为您已经在 aggregate_decompress_h1n1
函数中包含了 fastqc 输出。你不应该在规则 all
中再次包含这些输出,这会导致 snakemake 在执行检查点之前尝试执行 fastqc。
第三,你也应该把你的trim_qc
输出放在aggregate_decompress_h1n1
函数中,基本上它和fastqc是同一个问题,salmon
相关规则可能也是一样。
可能有一个潜在的问题,我注意到你使用了 fastqc 的包装器,我记得官方的 fastqc 包装器要求输出必须有 html
和 zip
,而你有一个 raw
前缀。但是我在官方文档中没有看到 0.80.3
版本,不确定你是否使用其他存储库来访问 wrapper