Snakemake：如何使用动态创建的文件名处理动态创建的文件？

Question

我有一个 multifasta 文件 DE.faa 包含未知数量的蛋白质序列并具有 fasta headers 例如GB012883-PA、GB009065-PA、GB007275-PA。我已经将 multifasta 文件（使用规则 gather_proteins_with_no_BL62_hits）拆分为文件名为：GB012883-PA.faa、GB009065-PA.faa、GB007275-PA.faa 的单个 fasta 文件，现在我希望远程对它们中的每一个进行 BLASTp .

作为 BLASTp 的输出，我希望有匹配文件名 GB012883-PA.tsv、GB009065-PA.tsv、GB007275-PA.tsv 的 tsv 文件和搜索策略文件：GB012883-PA.asn、GB009065-PA.asn, GB007275-PA.asn.

这是我的进展：

rule gather_proteins_with_no_BL62_hits:
    input:
        all_found_prot_seqs = "DE.faa"
    output:
        directory("no_BL62_hits/"), 
        touch("fasta_expansion.done")
    script:
        "gather_proteins_with_no_BL62_hits.R"


checkpoint BLAST_missing_protein_seqs:
    input:
        flag = "fasta_expansion.done",
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}.asn"
    shell:
        r"""
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
        """


def aggregate_input(wildcards):
    with open(checkpoints.BLAST_missing_protein_seqs.get(**wildcards).output.res, 'r') as f:
        return [seq_name.rstrip() + '.tsv' for seq_name in f]


rule aggregate:
    input:
        aggregate_input
    output:
        touch("BLAST45.done")

但是，当我运行收到 InputFunctionException 消息时：WorkflowError: Missing wildcard values for seq_name.

我的问题是：

如何通知Snakemake通配符seq_name指的是序列标识符GB012883-PA、GB009065-PA、GB007275-PA（并修复WorkflowError） ?

Answer 1

希望你已经自己解决了。如果没有，那么我希望这会有所帮助。我还没有测试过任何一个，所以你可能需要在这里和那里修补它：

checkpoint gather_proteins_with_no_BL62_hits:
    input:
        all_found_prot_seqs = "DE.faa"
    output:
        directory("no_BL62_hits/"), 
        touch("fasta_expansion.done")
    script:
        "gather_proteins_with_no_BL62_hits.R"


rule BLAST_missing_protein_seqs:
    input:
        flag = "fasta_expansion.done",
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}.asn"
    shell:
        r"""
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
        """


def aggregate_input(wildcards):
    checkpoint_output = checkpoints.gather_proteins_with_no_BL62_hits.get(**wildcards).output[0]
    seq_name, = glob_wildcards(checkpoint_output + "{seq_name}.faa")
    return expand("BL45_headerless/{seq_name}.{ext}", seq_name=seq_name, ext=["tsv", "asn"])


rule aggregate:
    input:
        aggregate_input
    output:
        touch("BLAST45.done")

我更改了哪个规则是检查点。我们可以 know/infer 在 gather_proteins_with_no_BL62_hits 执行之后我们的输出必须是什么，而不是在 BLAST_missing_protein_seqs 执行之后，因为那个已经假设我们知道我们想要什么输出。所以在我们执行 gather_proteins_with_no_BL62_hits 之后我们运行 aggregate_input。应该可以找到您需要的所有 seq_names。

Snakemake：如何使用动态创建的文件名处理动态创建的文件？

Snakemake: How to handle dynamically created files with dynamically created file names?

bioinformatics

snakemake