Snakemake：在分叉的工作流中移动到下一个之前创建一个目标输出

Question

假设您有一个带有通配符的工作流（下例中的 wc），并且您想要运行它用于该通配符的大量不同值（例如 1000 个样本）。通常我会创建一个 rule all 接受输入的函数生成 1000 个文件名。但我发现 Snakemake 将执行 one 1000 次然后 two 1000 次。如果 two 生成的中间文件非常大，这是有问题的，因为你最终会得到 1000 个大文件。

相反，我希望 Snakemake 生成 five_1.txt ... five_1000.txt，确保它在移动到下一个之前实际生成 rule all 的一个输出。这样，temp() 会在生成下一个 three_{wc}.txt 之前删除一个 three_{wc}.txt，这样您就不会得到大量大文件。

在线性工作流程中，您可以按照@Maarten-vd-Sande 的建议使用优先级。这是因为 Snakemake 会查看它可以执行的工作，并选择最高优先级，这将始终是线性工作流程中链下游的那个。然而，在分叉中，这是行不通的，因为分叉的两边需要有相同的优先级，但是 Snakemake 只是先执行所有一个规则。

rule one:
  input: "input_{wc}.txt"
  output: 
    touch("one_{wc}.txt"),
    touch("two_{wc}.txt")
  
rule two:
  input: "one_{wc}.txt"
  output: temp(touch("three_{wc}.txt"))

rule three:
  input: "two_{wc}.txt"
  output: touch("four_{wc}.txt")

rule four:
  input:
    "three_{wc}.txt",
    "four_{wc}.txt"
  output: "five_{wc}.txt"
  shell:
    """
    touch {output}
    """

Answer 1

如果它是叉子，就让它不是叉子。 two 必须在 three 之前执行。使用@Maarten-vd-Sande 的解决方案。

rule one:
  input: "input_{wc}.txt"
  output: 
    touch("one_{wc}.txt"),
    touch("two_{wc}.txt")
  
rule two:
  input: "one_{wc}.txt"
  output: temp(touch("three_{wc}.txt"))
  priority: 1

rule three:
  input: 
    "two_{wc}.txt", 
    "one_{wc}.txt"
  output: touch("four_{wc}.txt")
  priority: 2

rule four:
  input:
    "three_{wc}.txt",
    "four_{wc}.txt"
  output: touch("five_{wc}.txt")
  priority: 3

Snakemake：在分叉的工作流中移动到下一个之前创建一个目标输出

Snakemake: Create one target output before moving on to the next in a forked workflow

snakemake