Snakemake:如何使用动态创建的文件名处理动态创建的文件?

Snakemake: How to handle dynamically created files with dynamically created file names?

我有一个 multifasta 文件 DE.faa 包含未知数量的蛋白质序列并具有 fasta headers 例如GB012883-PAGB009065-PAGB007275-PA。我已经将 multifasta 文件(使用规则 gather_proteins_with_no_BL62_hits)拆分为文件名为:GB012883-PA.faaGB009065-PA.faaGB007275-PA.faa 的单个 fasta 文件,现在我希望远程对它们中的每一个进行 BLASTp .

作为 BLASTp 的输出,我希望有匹配文件名 GB012883-PA.tsvGB009065-PA.tsvGB007275-PA.tsv 的 tsv 文件和搜索策略文件:GB012883-PA.asnGB009065-PA.asn, GB007275-PA.asn.

这是我的进展:

rule gather_proteins_with_no_BL62_hits:
    input:
        all_found_prot_seqs = "DE.faa"
    output:
        directory("no_BL62_hits/"), 
        touch("fasta_expansion.done")
    script:
        "gather_proteins_with_no_BL62_hits.R"


checkpoint BLAST_missing_protein_seqs:
    input:
        flag = "fasta_expansion.done",
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}.asn"
    shell:
        r"""
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
        """


def aggregate_input(wildcards):
    with open(checkpoints.BLAST_missing_protein_seqs.get(**wildcards).output.res, 'r') as f:
        return [seq_name.rstrip() + '.tsv' for seq_name in f]


rule aggregate:
    input:
        aggregate_input
    output:
        touch("BLAST45.done")

但是,当我 运行 收到 InputFunctionException 消息时:WorkflowError: Missing wildcard values for seq_name.

我的问题是:

如何通知Snakemake通配符seq_name指的是序列标识符GB012883-PAGB009065-PAGB007275-PA(并修复WorkflowError) ?

希望你已经自己解决了。如果没有,那么我希望这会有所帮助。我还没有测试过任何一个,所以你可能需要在这里和那里修补它:

checkpoint gather_proteins_with_no_BL62_hits:
    input:
        all_found_prot_seqs = "DE.faa"
    output:
        directory("no_BL62_hits/"), 
        touch("fasta_expansion.done")
    script:
        "gather_proteins_with_no_BL62_hits.R"


rule BLAST_missing_protein_seqs:
    input:
        flag = "fasta_expansion.done",
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}.asn"
    shell:
        r"""
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
        """


def aggregate_input(wildcards):
    checkpoint_output = checkpoints.gather_proteins_with_no_BL62_hits.get(**wildcards).output[0]
    seq_name, = glob_wildcards(checkpoint_output + "{seq_name}.faa")
    return expand("BL45_headerless/{seq_name}.{ext}", seq_name=seq_name, ext=["tsv", "asn"])


rule aggregate:
    input:
        aggregate_input
    output:
        touch("BLAST45.done")

我更改了哪个规则是检查点。我们可以 know/infer 在 gather_proteins_with_no_BL62_hits 执行之后我们的输出必须是什么,而不是在 BLAST_missing_protein_seqs 执行之后,因为那个已经假设我们知道我们想要什么输出。所以在我们执行 gather_proteins_with_no_BL62_hits 之后我们 运行 aggregate_input。应该可以找到您需要的所有 seq_names