Snakemake 允许使用 * 吗?
Is the use of * allowed in Snakemake?
我需要连接某些目录中的某些文件,这些目录是在 Snakefile
中使用通配符创建的。我尝试创建以下规则来连接这些目录中的所有文件:
# concatenate output per hmm
rule concatenate:
input:
output_{hmm}/* ,
output:
output_{hmm}/cat_{hmm}.txt,
params:
cmd='cat'
shell:
'{params.cmd} {input} > {output} '
它不起作用并产生了以下错误:
"SyntaxError in line 62 of /scratch/data1/agalvez/domains/Snakefile_ecdf:
invalid syntax (Snakefile_ecdf, line 62)"
我不知道规则有什么问题,我想 *
的使用可能不够,但我想不出另一种方法来做我想做的事。
编辑:
这个问题可能缺少一些信息,所以我也会附上完整的 Snakefile:
ARCHIVE_FILE = 'output.tar.gz'
# a single output file
OUTPUT_FILE = 'output_{hmm}/{species}_{hmm}.out'
# a single input file
INPUT_FILE = 'proteins/{species}.fasta'
# a single hmm file
HMM_FILE = 'hmm/{hmm}.hmm'
# a single cat file
CAT_FILE = 'cat/cat_{hmm}.txt'
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input: ARCHIVE_FILE
# hmmsearch
rule hmm:
input:
species=INPUT_FILE ,
hmm=HMM_FILE
output:
OUTPUT_FILE,
params:
cmd='hmmsearch --noali -E 99 --tblout'
shell:
'{params.cmd} {output} {input.hmm} {input.species} '
# concatenate output per hmm
from glob import glob
rule concatenate:
input:
files = glob("output_{hmm}/*") ,
output:
CAT_FILE,
params:
cmd='cat'
shell:
'{params.cmd} {input.files} {output} '
# create an archive with all results
rule create_archive:
input: OUT, CAT,
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'
Snakefile 是 Python 代码,因此所有文件引用都应该是 strings/path-like 对象(或 variables/functions return 这样的对象)。但是,一般输入文件应该是特定的文件,而不是目录。
有几种方法可以解决这个问题,其中之一是将文件放在 python 中并显式传递它们:
from glob import glob
rule concatenate:
input:
files = glob("output_{hmm}/*") ,
output:
combined = "output_{hmm}/cat_{hmm}.txt",
params:
cmd='cat'
shell:
'{params.cmd} {input.files} {output.combined} '
请注意,就目前而言,此规则将在重复 运行 时引起问题,因为串联文件(“output_{hmm}/cat_{hmm}.txt”)将被全局化在重复 运行.
出于测试目的,让我们创建一个示例文件集(因此使用虚拟文件进行操作以确保工作流程正常运行)。在终端中,我 运行:
mkdir proteins && touch proteins/1.fasta proteins/2.fasta
mkdir hmm && touch hmm/A.hmm hmm/B.hmm
现在,除了规则 concatenate
之外,您的工作流程基本正确。此规则的输入由规则 hmm
创建,此规则的输出特定于 hmm
的通配符值。因此,您对给定 hmm
的 species
的所有值感兴趣。获得它的方法是使用 expand
但将 hmm
保留为通配符格式,使用 expand(OUTPUT_FILE, species=INP, hmm="{hmm}")
:
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
在下面的工作流程中,我修改了规则 hmm
以进行快速测试 运行,因此完整的工作流程将如下所示:
ARCHIVE_FILE = "output.tar.gz"
# a single output file
OUTPUT_FILE = "output_{hmm}/{species}_{hmm}.out"
# a single input file
INPUT_FILE = "proteins/{species}.fasta"
# a single hmm file
HMM_FILE = "hmm/{hmm}.hmm"
# a single cat file
CAT_FILE = "cat/cat_{hmm}.txt"
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input:
ARCHIVE_FILE,
# hmmsearch
rule hmm:
input:
species=INPUT_FILE,
hmm=HMM_FILE,
output:
touch(OUTPUT_FILE),
params:
# cmd='hmmsearch --noali -E 99 --tblout'
cmd="echo",
shell:
"{params.cmd} {output} {input.hmm} {input.species} "
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
# create an archive with all results
rule create_archive:
input:
CAT,
output:
ARCHIVE_FILE,
shell:
"tar -czvf {output} {input}"
我需要连接某些目录中的某些文件,这些目录是在 Snakefile
中使用通配符创建的。我尝试创建以下规则来连接这些目录中的所有文件:
# concatenate output per hmm
rule concatenate:
input:
output_{hmm}/* ,
output:
output_{hmm}/cat_{hmm}.txt,
params:
cmd='cat'
shell:
'{params.cmd} {input} > {output} '
它不起作用并产生了以下错误:
"SyntaxError in line 62 of /scratch/data1/agalvez/domains/Snakefile_ecdf:
invalid syntax (Snakefile_ecdf, line 62)"
我不知道规则有什么问题,我想 *
的使用可能不够,但我想不出另一种方法来做我想做的事。
编辑: 这个问题可能缺少一些信息,所以我也会附上完整的 Snakefile:
ARCHIVE_FILE = 'output.tar.gz'
# a single output file
OUTPUT_FILE = 'output_{hmm}/{species}_{hmm}.out'
# a single input file
INPUT_FILE = 'proteins/{species}.fasta'
# a single hmm file
HMM_FILE = 'hmm/{hmm}.hmm'
# a single cat file
CAT_FILE = 'cat/cat_{hmm}.txt'
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input: ARCHIVE_FILE
# hmmsearch
rule hmm:
input:
species=INPUT_FILE ,
hmm=HMM_FILE
output:
OUTPUT_FILE,
params:
cmd='hmmsearch --noali -E 99 --tblout'
shell:
'{params.cmd} {output} {input.hmm} {input.species} '
# concatenate output per hmm
from glob import glob
rule concatenate:
input:
files = glob("output_{hmm}/*") ,
output:
CAT_FILE,
params:
cmd='cat'
shell:
'{params.cmd} {input.files} {output} '
# create an archive with all results
rule create_archive:
input: OUT, CAT,
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'
Snakefile 是 Python 代码,因此所有文件引用都应该是 strings/path-like 对象(或 variables/functions return 这样的对象)。但是,一般输入文件应该是特定的文件,而不是目录。
有几种方法可以解决这个问题,其中之一是将文件放在 python 中并显式传递它们:
from glob import glob
rule concatenate:
input:
files = glob("output_{hmm}/*") ,
output:
combined = "output_{hmm}/cat_{hmm}.txt",
params:
cmd='cat'
shell:
'{params.cmd} {input.files} {output.combined} '
请注意,就目前而言,此规则将在重复 运行 时引起问题,因为串联文件(“output_{hmm}/cat_{hmm}.txt”)将被全局化在重复 运行.
出于测试目的,让我们创建一个示例文件集(因此使用虚拟文件进行操作以确保工作流程正常运行)。在终端中,我 运行:
mkdir proteins && touch proteins/1.fasta proteins/2.fasta
mkdir hmm && touch hmm/A.hmm hmm/B.hmm
现在,除了规则 concatenate
之外,您的工作流程基本正确。此规则的输入由规则 hmm
创建,此规则的输出特定于 hmm
的通配符值。因此,您对给定 hmm
的 species
的所有值感兴趣。获得它的方法是使用 expand
但将 hmm
保留为通配符格式,使用 expand(OUTPUT_FILE, species=INP, hmm="{hmm}")
:
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
在下面的工作流程中,我修改了规则 hmm
以进行快速测试 运行,因此完整的工作流程将如下所示:
ARCHIVE_FILE = "output.tar.gz"
# a single output file
OUTPUT_FILE = "output_{hmm}/{species}_{hmm}.out"
# a single input file
INPUT_FILE = "proteins/{species}.fasta"
# a single hmm file
HMM_FILE = "hmm/{hmm}.hmm"
# a single cat file
CAT_FILE = "cat/cat_{hmm}.txt"
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input:
ARCHIVE_FILE,
# hmmsearch
rule hmm:
input:
species=INPUT_FILE,
hmm=HMM_FILE,
output:
touch(OUTPUT_FILE),
params:
# cmd='hmmsearch --noali -E 99 --tblout'
cmd="echo",
shell:
"{params.cmd} {output} {input.hmm} {input.species} "
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
# create an archive with all results
rule create_archive:
input:
CAT,
output:
ARCHIVE_FILE,
shell:
"tar -czvf {output} {input}"