Snakemake 允许使用 * 吗？

Question

我需要连接某些目录中的某些文件，这些目录是在 Snakefile 中使用通配符创建的。我尝试创建以下规则来连接这些目录中的所有文件：

# concatenate output per hmm
rule concatenate:
    input:
        output_{hmm}/* ,
    output: 
        output_{hmm}/cat_{hmm}.txt,
    params:
        cmd='cat'
    shell: 
        '{params.cmd} {input} > {output} '

它不起作用并产生了以下错误：

"SyntaxError in line 62 of /scratch/data1/agalvez/domains/Snakefile_ecdf:
invalid syntax (Snakefile_ecdf, line 62)"

我不知道规则有什么问题，我想 * 的使用可能不够，但我想不出另一种方法来做我想做的事。

编辑：这个问题可能缺少一些信息，所以我也会附上完整的 Snakefile：

ARCHIVE_FILE = 'output.tar.gz'

# a single output file
OUTPUT_FILE = 'output_{hmm}/{species}_{hmm}.out'

# a single input file
INPUT_FILE = 'proteins/{species}.fasta'

# a single hmm file
HMM_FILE = 'hmm/{hmm}.hmm'

# a single cat file
CAT_FILE = 'cat/cat_{hmm}.txt'

# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species

# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm

# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)

# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)

# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
    input: ARCHIVE_FILE

# hmmsearch
rule hmm:
    input:
        species=INPUT_FILE ,
        hmm=HMM_FILE
    output: 
        OUTPUT_FILE,
    params:
        cmd='hmmsearch --noali -E 99 --tblout'
    shell: 
        '{params.cmd} {output} {input.hmm} {input.species} '

# concatenate output per hmm
from glob import glob

rule concatenate:
    input:
        files = glob("output_{hmm}/*") ,
    output: 
        CAT_FILE,
    params:
        cmd='cat'
    shell: 
        '{params.cmd} {input.files} {output} '

# create an archive with all results
rule create_archive:
    input: OUT, CAT,
    output: ARCHIVE_FILE
    shell: 'tar -czvf {output} {input}'

Answer 1

Snakefile 是 Python 代码，因此所有文件引用都应该是 strings/path-like 对象（或 variables/functions return 这样的对象）。但是，一般输入文件应该是特定的文件，而不是目录。

有几种方法可以解决这个问题，其中之一是将文件放在 python 中并显式传递它们：

from glob import glob

rule concatenate:
    input:
        files = glob("output_{hmm}/*") ,
    output: 
        combined = "output_{hmm}/cat_{hmm}.txt",
    params:
        cmd='cat'
    shell: 
        '{params.cmd} {input.files} {output.combined} '

请注意，就目前而言，此规则将在重复运行时引起问题，因为串联文件（“output_{hmm}/cat_{hmm}.txt”）将被全局化在重复运行.

Answer 2

出于测试目的，让我们创建一个示例文件集（因此使用虚拟文件进行操作以确保工作流程正常运行）。在终端中，我运行:

mkdir proteins && touch proteins/1.fasta proteins/2.fasta
mkdir hmm && touch hmm/A.hmm hmm/B.hmm

现在，除了规则 concatenate 之外，您的工作流程基本正确。此规则的输入由规则 hmm 创建，此规则的输出特定于 hmm 的通配符值。因此，您对给定 hmm 的 species 的所有值感兴趣。获得它的方法是使用 expand 但将 hmm 保留为通配符格式，使用 expand(OUTPUT_FILE, species=INP, hmm="{hmm}"):

rule concatenate:
    input:
        expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
    output:
        CAT_FILE,
    params:
        cmd="cat",
    shell:
        "{params.cmd} {input} > {output} "

在下面的工作流程中，我修改了规则 hmm 以进行快速测试运行，因此完整的工作流程将如下所示：

ARCHIVE_FILE = "output.tar.gz"

# a single output file
OUTPUT_FILE = "output_{hmm}/{species}_{hmm}.out"

# a single input file
INPUT_FILE = "proteins/{species}.fasta"

# a single hmm file
HMM_FILE = "hmm/{hmm}.hmm"

# a single cat file
CAT_FILE = "cat/cat_{hmm}.txt"

# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species

# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm

# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)

# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)


# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
    input:
        ARCHIVE_FILE,


# hmmsearch
rule hmm:
    input:
        species=INPUT_FILE,
        hmm=HMM_FILE,
    output:
        touch(OUTPUT_FILE),
    params:
        #        cmd='hmmsearch --noali -E 99 --tblout'
        cmd="echo",
    shell:
        "{params.cmd} {output} {input.hmm} {input.species} "


rule concatenate:
    input:
        expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
    output:
        CAT_FILE,
    params:
        cmd="cat",
    shell:
        "{params.cmd} {input} > {output} "


# create an archive with all results
rule create_archive:
    input:
        CAT,
    output:
        ARCHIVE_FILE,
    shell:
        "tar -czvf {output} {input}"

Snakemake 允许使用 * 吗？

Is the use of * allowed in Snakemake?

python

shell

glob

snakemake