Snakemake 脚本中带有 python 函数的奇怪 NameError
Weird NameError with python function in Snakemake script
这是我问的问题的扩展 。我查看了整个 Whosebug,但没有找到这个特定 NameError:
的实例
Building DAG of jobs...
Updating job done.
InputFunctionException in line 148 of /home/nasiegel/2022-h1n1/Snakefile:
Error:
NameError: free variable 'combinator' referenced before assignment in enclosing scope
Wildcards:
Traceback:
File "/home/nasiegel/2022-h1n1/Snakefile", line 131, in aggregate_decompress_h1n1
我认为这是一个与我的函数中的符号文件路径有关的问题:
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
但是,对路径进行硬编码并不能解决问题。我在下面附上了完整的 Snakefile,如有任何建议,我们将不胜感激。
原文件
# Snakemake file - input raw reads to generate quant files for analysis in R
configfile: "config.yaml"
import io
import os
import pandas as pd
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
PROJ = config["proj_name"]
INPUTDIR = config["raw-data"]
SCRATCH = config["scratch"]
REFERENCE = config["ref"]
OUTPUTDIR = config["outputDIR"]
# Adapters
SE_ADAPTER = config['seq']['SE']
SE_SEQUENCE = config['seq']['trueseq-se']
# Organsim
TRANSCRIPTOME = config['transcriptome']['rhesus']
SPECIES = config['species']['rhesus']
SAMPLE_LIST = glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz").basenames
rule all:
input:
"final.txt",
# dowload referemce files
REFERENCE + SE_ADAPTER,
REFERENCE + SPECIES,
# multiqc
SCRATCH + "fastqc/raw_multiqc.html",
SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
SCRATCH + "fastqc/trimmed_multiqc.html",
SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
rule download_trimmomatic_adapter_file:
output: REFERENCE + SE_ADAPTER
shell: "curl -L -o {output} {SE_SEQUENCE}"
rule download_transcriptome:
output: REFERENCE + SPECIES
shell: "curl -L -o {output} {TRANSCRIPTOME}"
rule download_data:
output: "high_quality_files.tgz"
shell: "curl -L -o {output} https://osf.io/pcxfg/download"
checkpoint decompress_h1n1:
output: directory(INPUTDIR)
input: "high_quality_files.tgz"
shell: "tar xzvf {input}"
rule fastqc:
input: INPUTDIR + "{basenames}_R1.fastq.gz"
output:
raw_html = SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
raw_zip = SCRATCH + "fastqc/{basenames}_R1_fastqc.zip"
conda: "env/rnaseq.yml"
wrapper: "0.80.3/bio/fastqc"
rule multiqc:
input:
raw_qc = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand(SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
output:
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
trim_multi_html = SCRATCH + "fastqc/trimmed_multiqc.html",
trim_multi_stats = SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
conda: "env/rnaseq.yml"
shell:
"""
multiqc -n multiqc.html {input.raw_qc} #run multiqc
mv multiqc.html {output.raw_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.raw_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
#repeat for trimmed data
multiqc -n multiqc.html {input.trim_qc} #run multiqc
mv multiqc.html {output.trim_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.trim_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
"""
rule trimmmomatic_se:
input:
reads= INPUTDIR + "{basenames}_R1.fastq.gz",
adapters= REFERENCE + SE_ADAPTER,
output:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
unpaired = SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz"
conda: "env/rnaseq.yml"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trim_unpaired.log"
shell:
"""
trimmomatic SE {input.reads} \
{output.reads} {output.unpaired} \
ILLUMINACLIP:{input.adapters}:2:0:15 LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 MINLEN:25
"""
rule fastqc_trim:
input: SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz"
output:
html = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
zip = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trimmed.log"
conda: "env/rnaseq.yml"
wrapper: "0.35.2/bio/fastqc"
rule salmon_quant:
input:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
index_dir = OUTPUTDIR + "quant/sc_ensembl_index"
output: OUTPUTDIR + "{basenames}_quant/quant.sf"
params: OUTPUTDIR + "{basenames}_quant"
log: SCRATCH + "logs/salmon/{basenames}_quant.log"
conda: "env/rnaseq.yml"
shell:
"""
salmon quant -i {input.index_dir} --libType A -r {input.reads} -o {params} --seqBias --gcBias --validateMappings
"""
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
rule salmon_index:
input: REFERENCE + SPECIES
output: directory(OUTPUTDIR + "quant/sc_ensembl_index")
conda: "env/rnaseq.yml"
shell: "salmon index --index {output} --transcripts {input} # --type quasi"
rule done:
input: aggregate_decompress_h1n1
output: "final.txt"
shell: "touch {output}"
我想是因为你用错了expand
函数,expand
只接受两个位置参数,第一个是模式,第二个是函数(可选)。如果您想提供多种模式,您应该将这些模式包装在列表中。
在研究了 snakemake 的源代码后,发现 expand
函数不检查用户是否提供了 < 3 个位置参数,if-else
中有一个变量 combinator
只有在有 1 个或 2 个位置参数时才会创建,您提供的大量位置参数会跳过这一部分,并在稍后尝试使用 combinator
时导致错误。
源代码:https://snakemake.readthedocs.io/en/v6.5.4/_modules/snakemake/io.html
这是我问的问题的扩展
Building DAG of jobs...
Updating job done.
InputFunctionException in line 148 of /home/nasiegel/2022-h1n1/Snakefile:
Error:
NameError: free variable 'combinator' referenced before assignment in enclosing scope
Wildcards:
Traceback:
File "/home/nasiegel/2022-h1n1/Snakefile", line 131, in aggregate_decompress_h1n1
我认为这是一个与我的函数中的符号文件路径有关的问题:
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
但是,对路径进行硬编码并不能解决问题。我在下面附上了完整的 Snakefile,如有任何建议,我们将不胜感激。
原文件
# Snakemake file - input raw reads to generate quant files for analysis in R
configfile: "config.yaml"
import io
import os
import pandas as pd
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
PROJ = config["proj_name"]
INPUTDIR = config["raw-data"]
SCRATCH = config["scratch"]
REFERENCE = config["ref"]
OUTPUTDIR = config["outputDIR"]
# Adapters
SE_ADAPTER = config['seq']['SE']
SE_SEQUENCE = config['seq']['trueseq-se']
# Organsim
TRANSCRIPTOME = config['transcriptome']['rhesus']
SPECIES = config['species']['rhesus']
SAMPLE_LIST = glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz").basenames
rule all:
input:
"final.txt",
# dowload referemce files
REFERENCE + SE_ADAPTER,
REFERENCE + SPECIES,
# multiqc
SCRATCH + "fastqc/raw_multiqc.html",
SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
SCRATCH + "fastqc/trimmed_multiqc.html",
SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
rule download_trimmomatic_adapter_file:
output: REFERENCE + SE_ADAPTER
shell: "curl -L -o {output} {SE_SEQUENCE}"
rule download_transcriptome:
output: REFERENCE + SPECIES
shell: "curl -L -o {output} {TRANSCRIPTOME}"
rule download_data:
output: "high_quality_files.tgz"
shell: "curl -L -o {output} https://osf.io/pcxfg/download"
checkpoint decompress_h1n1:
output: directory(INPUTDIR)
input: "high_quality_files.tgz"
shell: "tar xzvf {input}"
rule fastqc:
input: INPUTDIR + "{basenames}_R1.fastq.gz"
output:
raw_html = SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
raw_zip = SCRATCH + "fastqc/{basenames}_R1_fastqc.zip"
conda: "env/rnaseq.yml"
wrapper: "0.80.3/bio/fastqc"
rule multiqc:
input:
raw_qc = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand(SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
output:
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
trim_multi_html = SCRATCH + "fastqc/trimmed_multiqc.html",
trim_multi_stats = SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
conda: "env/rnaseq.yml"
shell:
"""
multiqc -n multiqc.html {input.raw_qc} #run multiqc
mv multiqc.html {output.raw_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.raw_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
#repeat for trimmed data
multiqc -n multiqc.html {input.trim_qc} #run multiqc
mv multiqc.html {output.trim_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.trim_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
"""
rule trimmmomatic_se:
input:
reads= INPUTDIR + "{basenames}_R1.fastq.gz",
adapters= REFERENCE + SE_ADAPTER,
output:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
unpaired = SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz"
conda: "env/rnaseq.yml"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trim_unpaired.log"
shell:
"""
trimmomatic SE {input.reads} \
{output.reads} {output.unpaired} \
ILLUMINACLIP:{input.adapters}:2:0:15 LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 MINLEN:25
"""
rule fastqc_trim:
input: SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz"
output:
html = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
zip = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trimmed.log"
conda: "env/rnaseq.yml"
wrapper: "0.35.2/bio/fastqc"
rule salmon_quant:
input:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
index_dir = OUTPUTDIR + "quant/sc_ensembl_index"
output: OUTPUTDIR + "{basenames}_quant/quant.sf"
params: OUTPUTDIR + "{basenames}_quant"
log: SCRATCH + "logs/salmon/{basenames}_quant.log"
conda: "env/rnaseq.yml"
shell:
"""
salmon quant -i {input.index_dir} --libType A -r {input.reads} -o {params} --seqBias --gcBias --validateMappings
"""
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
rule salmon_index:
input: REFERENCE + SPECIES
output: directory(OUTPUTDIR + "quant/sc_ensembl_index")
conda: "env/rnaseq.yml"
shell: "salmon index --index {output} --transcripts {input} # --type quasi"
rule done:
input: aggregate_decompress_h1n1
output: "final.txt"
shell: "touch {output}"
我想是因为你用错了expand
函数,expand
只接受两个位置参数,第一个是模式,第二个是函数(可选)。如果您想提供多种模式,您应该将这些模式包装在列表中。
在研究了 snakemake 的源代码后,发现 expand
函数不检查用户是否提供了 < 3 个位置参数,if-else
中有一个变量 combinator
只有在有 1 个或 2 个位置参数时才会创建,您提供的大量位置参数会跳过这一部分,并在稍后尝试使用 combinator
时导致错误。
源代码:https://snakemake.readthedocs.io/en/v6.5.4/_modules/snakemake/io.html