在 Snakemake 工作流程中作为输入的值数组

Question

我开始将我的工作流程从 Nextflow 迁移到 Snakemake 并且已经在我的管道开始时碰壁了，这些管道通常以数字列表开头（代表“运行 number” 来自我们的检测器）。

例如我有一个run-list.txt喜欢

# detector_id run_number
75 63433
75 67325
42 57584
42 57899
42 58998

然后需要将其逐行传递给查询数据库或数据存储系统并将文件检索到本地系统的进程。

这意味着例如75 63433 将通过接收 detector_id=75 和 run_number=63433 作为输入参数的规则生成输出 RUN_00000075_00063433.h5。

使用 Nextflow 这相当容易，只需定义一个进程来发出这些值的元组。

我不太明白如何在 Snakemake 中做这样的事情，因为输入和输出似乎总是需要文件（远程或本地）。事实上，有些文件确实可以通过 iRODS and/or XRootD 访问，但即便如此，我也需要先从运行-selection 开始，它在列表中定义，如上面的 run-list.txt .

我现在的问题是：解决这个问题的 Snakemake 风格方法是什么？

一个无效的伪代码将是：

rule:
    input:
        [line for line in open("run-list.txt").readlines()]
    output:
        "{detector_id}_{run_number}.h5"
    shell:
        "detector_id, run_number = line.split()"
        "touch "{detector_id}_{run_number}.h5""

Answer 1

要完成这项工作，您需要两种成分：

指定生成单个文件的逻辑的规则（如有必要，定义任何文件依赖项）
定义应计算哪个文件的规则，按照惯例，此规则称为all。

这里是代码的粗略草图：

def process_lines(file_name):
    """generates id/run, ignoring non-numeric lines"""
    with open(file_name, "r") as f:
        for line in f:
            detector_id, run_number, *_ = line.split()
            if detector_id.isnumeric() and run_number.isnumeric():
                detector_id = detector_id.zfill(8)
                run_number = run_number.zfill(8)
                yield detector_id, run_number


out_file_format = "{detector_id}_{run_number}.h5"
final_files = [
    out_file_format.format(detector_id=detector_id, run_number=run_number)
    for detector_id, run_number in process_lines("run-list.txt")
]


rule all:
    """Collect all outputs."""
    input:
        final_files,


rule:
    """Generate an output"""
    output:
        out_file_format,
    shell:
        """
        echo {wildcards[detector_id]}
        echo {wildcards[run_number]}
        echo {output}
        """

Answer 2

在 Snakemake 中，您将使用此文件生成要输入工作流程的值列表。您将在规则之外解析检测器 ID 和运行编号。如果您想使用外部库，您的运行列表看起来可以用 pandas 巧妙地处理。

import pandas as pd

run_list = pd.read_csv("run-list.txt", header=0, names=["detector_id", "run_number"], sep=" ")
detector_ids = list(run_list["detector_id"])
run_numbers = list(run_list["run_number"])

那么运行的规则是你想要获取一个文件，假设你的文件名不是需要 zero-padded:

rule do_something:
    output: "{detector_id}_{run_number}.h5"
    shell: "do_something_with {wildcards.detector_id} {wildcards.run_number}"

仅凭这条规则，detector_id 和 run_number 理论上可以是任何东西，因此您需要一些东西来告诉 Snakemake 运行以产生输出你想要的。运行对于文件中的 all 行，您需要设置一个规则，将文件定义的所有潜在输出作为输入。

rule run_all:
    input: expand("{detector_id}_{run_number}.h5", zip, detector_id=detector_ids, run_number=run_numbers)

与 zip 部分确保第一个检测器 ID 与第一个运行编号一致，依此类推。

最后，您运行它指定您想要运行的规则名称，因此 snakemake run_all。

Answer 3

已经有一些很好的答案，但由于我在此期间获得了代码，所以这是我的 2p。另存为 Snakefile 它应该可以运行。

import pandas

# In reality you read this from file using pandas.read_csv.
# Or use a solution other than pandas dataframes.
run_list = [(75, 63433),
(75, 67325),
(42, 57584),
(42, 57899),
(42, 58998)]

run_list = pandas.DataFrame(run_list, columns= ['detector_id', 'run_id'])

rule all:
    input:
        expand('RUN_{detector_id}_{run_id}.h5', zip, detector_id= run_list.detector_id, run_id= run_list.run_id),

rule make_run:
    output:
        'RUN_{detector_id}_{run_id}.h5',
    shell:
        r"""
        touch {output}
        """

您需要对 zero-padding 进行一些字符串操作，但这是 python 的事情，而不是 snakemake 的事情。

在 Snakemake 工作流程中作为输入的值数组

Array of values as input in Snakemake workflows

python

directed-acyclic-graphs

pandas

snakemake

nextflow