如何在 snakemake 的扩展函数参数中使用通配符?

How to use a wildcard within expand function parameters in snakemake?

我有一个 json 文件,如下所示:

{
    "foo": {
        "bar1": 
            {"A1": {"name": "A1", "path": "/path/to/A1"}, 
             "B1": {"name": "B1", "path": "/path/to/B1"},
             "C1": {"name": "C1", "path": "/path/to/C1"},
             "D1": {"name": "D1", "path": "/path/to/D1"}},
        "bar2": 
            {"A2": {"name": "A2", "path": "/path/to/A2"}, 
             "B2": {"name": "B2", "path": "/path/to/B2"},
             "C2": {"name": "C2", "path": "/path/to/C2"},
             "D2": {"name": "D2", "path": "/path/to/D2"}}}
}

我正在尝试 运行 我的 snakemake 管道分别处理样本集 'bar1' 和 'bar2' 中的样本,将结果放入它们自己的文件夹中。当我扩展我的通配符时,我不想要样本集和样本的所有迭代,我只想要它们在它们的特定组中,就像这样:

tmp/bar1/A1.bam
tmp/bar1/B1.bam
tmp/bar1/C1.bam
tmp/bar1/D1.bam
tmp/bar2/A2.bam
tmp/bar2/B2.bam
tmp/bar2/C2.bam
tmp/bar2/D2.bam

希望我的蛇文件能帮助解释。我试过这样设置蛇文件:

sample_sets = [ i for i in config['foo'] ]

samples_dict = config['foo'] #cleans it up

def get_samples(wildcards):
    return list(samples_dict[wildcards.sample_set].keys())

rule all:
    input:
        expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = get_samples), sample_set = sample_sets),

这不起作用,我的文件名以“”结尾! 我也试过:

rule all:
    input:
        expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = list(samples_dict["{{sample_set}}"].keys()), sample_set = sample_sets),

但这是一个 KeyError。 也试过这个:

rule all:
    input:
        [ ["tmp/{{sample_set}}/{sample}.aligned_bam.core.bam".format( sample = sample ) for sample in list(samples_dict[sample_set].keys())] for sample_set in sample_sets ]

出现“无法从输出文件确定输入文件中的通配符: 'sample_set'”错误。

我觉得一定有一种简单的方法可以做到这一点,也许我是个白痴。

非常感谢任何帮助!如果我错过了一些细节,请告诉我。

有可能使用 custom combinatoric function in expand。大多数情况下,此函数是 zip,但是,在您的情况下,嵌套字典形状将需要设计一个自定义函数。相反,一个更简单的解决方案是使用 Python 构建所需文件的列表。

d = {
    "foo": {
        "bar1": {
            "A1": {"name": "A1", "path": "/path/to/A1"},
            "B1": {"name": "B1", "path": "/path/to/B1"},
            "C1": {"name": "C1", "path": "/path/to/C1"},
            "D1": {"name": "D1", "path": "/path/to/D1"},
        },
        "bar2": {
            "A2": {"name": "A2", "path": "/path/to/A2"},
            "B2": {"name": "B2", "path": "/path/to/B2"},
            "C2": {"name": "C2", "path": "/path/to/C2"},
            "D2": {"name": "D2", "path": "/path/to/D2"},
        },
    }
}

list_files = []

for key in d["foo"]:
    for nested_key in d["foo"][key]:
        _tmp = f"tmp/{key}/{nested_key}.bam"
        list_files.append(_tmp)

print(*list_files, sep="\n")
#tmp/bar1/A1.bam
#tmp/bar1/B1.bam
#tmp/bar1/C1.bam
#tmp/bar1/D1.bam
#tmp/bar2/A2.bam
#tmp/bar2/B2.bam
#tmp/bar2/C2.bam
#tmp/bar2/D2.bam

@SultanOrazbayev 有权这样做,但只是提出几个替代方案。

如果您喜欢这些循环,那么编写它的 pythonic 方式是使用列表理解。如果您有巨大的文件列表,您可能会注意到性能有所提高。

list_files = [
    f"tmp/{key}/{nested_key}.bam"
    for key in d["foo"]
    for nested_key in d["foo"][key]
]

我认为使用 expand 的唯一方法基本上是构建相同的列表。我将它作为字典传入,也保留通配符名称,尽管元组会更有效率。 expand 的优点是,如果您将文件名放在配置变量中并且无法轻松格式化它,想要保留有意义的通配符名称,或者对其他通配符使用 allow_missing:

wcs = [{'sample_set': sample_set, 'sample': sample}
    for sample_set in d["foo"]
    for sample in d["foo"][sample_set]
    ]


list_files = expand("tmp/{sample_set}/{sample}.bam", zip, 
        sample_set=[wc['sample_set'] for wc in wcs],
        sample=[wc['sample'] for wc in wcs],
        )

有时 snakemake 方式不是 pythonic!