限制要处理的数量（或大小）或未完成的中间输出

Question

我有一个工作流，其中较早、较快的步骤会产生较大的中间输出，然后由较慢的后续步骤消耗。例如，想象一个解压缩 gzip 文件并使用 bzip2 重新压缩它们的工作流程。

这是一个未经测试的例子来说明我的问题：

rule decompress:
  input:
    "gz/{dataset}.dat.gz"
  output:
    temp("decompress/{dataset}.dat")
  shell:
    "gunzip -c ${input} > ${output}"

rule compress:
  input:
    "decompress/{dataset}.dat"
  output:
    "bzip/{dataset}.dat.bz2"
  shell:
    "bzip2 -c ${input} > ${output}"

我的问题是，由于 decompress 步骤比第二步 compress 运行得更快，它往往会用未压缩的文件填满我的磁盘 space。我想知道，在这种情况下，有没有办法限制等待后者处理的中间数据集的数量（或大小）或中间数据集，较慢的规则？

干杯。

Answer 1

可以指定任意资源，例如：

rule decompress:
  input:
    "gz/{dataset}.dat.gz"
  output:
    temp("decompress/{dataset}.dat")
  resources:
    # this is arbitrary name to control how many
    # jobs can run at the same time (will control
    # by specifying resource availability/constraint)
    limit_space=1,
  shell:
    "gunzip -c ${input} > ${output}"

rule compress:
  input:
    "decompress/{dataset}.dat"
  output:
    "bzip/{dataset}.dat.bz2"
  resources:
    # note that compress/decompress are now
    # competing for the limited resource
    limit_space=1,
  priority:
    # so specifying a higher priority for this
    # rule will make sure that it runs before
    # decompress rule
    100,
  shell:
    "bzip2 -c ${input} > ${output}"

要指定resource约束，可以使用cli:

snakemake --resources limit_space=6

docs on resources 中有更多详细信息。

更新：正如评论中所指出的，上面的命令最初将启动 decompress 规则的 6 个实例，然后在它们完成后，一个接一个地启动新的 compress 实例。最大执行decompress个实例数为6（如果有多个完成的解压实例，则下一个释放的资源槽将按compress规则占用）。

此外，当有其他规则竞争核心时，使用专用资源可能会很有用，因此通过使用专用资源，可以隔离特定于 decompress/compress 的约束。

限制要处理的数量（或大小）或未完成的中间输出

limiting the number (or size) or outstanding intermediate outputs to be processed

snakemake