如何在 snakemake 的扩展函数参数中使用通配符?
How to use a wildcard within expand function parameters in snakemake?
我有一个 json 文件,如下所示:
{
"foo": {
"bar1":
{"A1": {"name": "A1", "path": "/path/to/A1"},
"B1": {"name": "B1", "path": "/path/to/B1"},
"C1": {"name": "C1", "path": "/path/to/C1"},
"D1": {"name": "D1", "path": "/path/to/D1"}},
"bar2":
{"A2": {"name": "A2", "path": "/path/to/A2"},
"B2": {"name": "B2", "path": "/path/to/B2"},
"C2": {"name": "C2", "path": "/path/to/C2"},
"D2": {"name": "D2", "path": "/path/to/D2"}}}
}
我正在尝试 运行 我的 snakemake 管道分别处理样本集 'bar1' 和 'bar2' 中的样本,将结果放入它们自己的文件夹中。当我扩展我的通配符时,我不想要样本集和样本的所有迭代,我只想要它们在它们的特定组中,就像这样:
tmp/bar1/A1.bam
tmp/bar1/B1.bam
tmp/bar1/C1.bam
tmp/bar1/D1.bam
tmp/bar2/A2.bam
tmp/bar2/B2.bam
tmp/bar2/C2.bam
tmp/bar2/D2.bam
希望我的蛇文件能帮助解释。我试过这样设置蛇文件:
sample_sets = [ i for i in config['foo'] ]
samples_dict = config['foo'] #cleans it up
def get_samples(wildcards):
return list(samples_dict[wildcards.sample_set].keys())
rule all:
input:
expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = get_samples), sample_set = sample_sets),
这不起作用,我的文件名以“”结尾!
我也试过:
rule all:
input:
expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = list(samples_dict["{{sample_set}}"].keys()), sample_set = sample_sets),
但这是一个 KeyError。
也试过这个:
rule all:
input:
[ ["tmp/{{sample_set}}/{sample}.aligned_bam.core.bam".format( sample = sample ) for sample in list(samples_dict[sample_set].keys())] for sample_set in sample_sets ]
出现“无法从输出文件确定输入文件中的通配符:
'sample_set'”错误。
我觉得一定有一种简单的方法可以做到这一点,也许我是个白痴。
非常感谢任何帮助!如果我错过了一些细节,请告诉我。
有可能使用 custom combinatoric function in expand。大多数情况下,此函数是 zip
,但是,在您的情况下,嵌套字典形状将需要设计一个自定义函数。相反,一个更简单的解决方案是使用 Python 构建所需文件的列表。
d = {
"foo": {
"bar1": {
"A1": {"name": "A1", "path": "/path/to/A1"},
"B1": {"name": "B1", "path": "/path/to/B1"},
"C1": {"name": "C1", "path": "/path/to/C1"},
"D1": {"name": "D1", "path": "/path/to/D1"},
},
"bar2": {
"A2": {"name": "A2", "path": "/path/to/A2"},
"B2": {"name": "B2", "path": "/path/to/B2"},
"C2": {"name": "C2", "path": "/path/to/C2"},
"D2": {"name": "D2", "path": "/path/to/D2"},
},
}
}
list_files = []
for key in d["foo"]:
for nested_key in d["foo"][key]:
_tmp = f"tmp/{key}/{nested_key}.bam"
list_files.append(_tmp)
print(*list_files, sep="\n")
#tmp/bar1/A1.bam
#tmp/bar1/B1.bam
#tmp/bar1/C1.bam
#tmp/bar1/D1.bam
#tmp/bar2/A2.bam
#tmp/bar2/B2.bam
#tmp/bar2/C2.bam
#tmp/bar2/D2.bam
@SultanOrazbayev 有权这样做,但只是提出几个替代方案。
如果您喜欢这些循环,那么编写它的 pythonic 方式是使用列表理解。如果您有巨大的文件列表,您可能会注意到性能有所提高。
list_files = [
f"tmp/{key}/{nested_key}.bam"
for key in d["foo"]
for nested_key in d["foo"][key]
]
我认为使用 expand 的唯一方法基本上是构建相同的列表。我将它作为字典传入,也保留通配符名称,尽管元组会更有效率。 expand 的优点是,如果您将文件名放在配置变量中并且无法轻松格式化它,想要保留有意义的通配符名称,或者对其他通配符使用 allow_missing:
wcs = [{'sample_set': sample_set, 'sample': sample}
for sample_set in d["foo"]
for sample in d["foo"][sample_set]
]
list_files = expand("tmp/{sample_set}/{sample}.bam", zip,
sample_set=[wc['sample_set'] for wc in wcs],
sample=[wc['sample'] for wc in wcs],
)
有时 snakemake 方式不是 pythonic!
我有一个 json 文件,如下所示:
{
"foo": {
"bar1":
{"A1": {"name": "A1", "path": "/path/to/A1"},
"B1": {"name": "B1", "path": "/path/to/B1"},
"C1": {"name": "C1", "path": "/path/to/C1"},
"D1": {"name": "D1", "path": "/path/to/D1"}},
"bar2":
{"A2": {"name": "A2", "path": "/path/to/A2"},
"B2": {"name": "B2", "path": "/path/to/B2"},
"C2": {"name": "C2", "path": "/path/to/C2"},
"D2": {"name": "D2", "path": "/path/to/D2"}}}
}
我正在尝试 运行 我的 snakemake 管道分别处理样本集 'bar1' 和 'bar2' 中的样本,将结果放入它们自己的文件夹中。当我扩展我的通配符时,我不想要样本集和样本的所有迭代,我只想要它们在它们的特定组中,就像这样:
tmp/bar1/A1.bam
tmp/bar1/B1.bam
tmp/bar1/C1.bam
tmp/bar1/D1.bam
tmp/bar2/A2.bam
tmp/bar2/B2.bam
tmp/bar2/C2.bam
tmp/bar2/D2.bam
希望我的蛇文件能帮助解释。我试过这样设置蛇文件:
sample_sets = [ i for i in config['foo'] ]
samples_dict = config['foo'] #cleans it up
def get_samples(wildcards):
return list(samples_dict[wildcards.sample_set].keys())
rule all:
input:
expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = get_samples), sample_set = sample_sets),
这不起作用,我的文件名以“
rule all:
input:
expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = list(samples_dict["{{sample_set}}"].keys()), sample_set = sample_sets),
但这是一个 KeyError。 也试过这个:
rule all:
input:
[ ["tmp/{{sample_set}}/{sample}.aligned_bam.core.bam".format( sample = sample ) for sample in list(samples_dict[sample_set].keys())] for sample_set in sample_sets ]
出现“无法从输出文件确定输入文件中的通配符: 'sample_set'”错误。
我觉得一定有一种简单的方法可以做到这一点,也许我是个白痴。
非常感谢任何帮助!如果我错过了一些细节,请告诉我。
有可能使用 custom combinatoric function in expand。大多数情况下,此函数是 zip
,但是,在您的情况下,嵌套字典形状将需要设计一个自定义函数。相反,一个更简单的解决方案是使用 Python 构建所需文件的列表。
d = {
"foo": {
"bar1": {
"A1": {"name": "A1", "path": "/path/to/A1"},
"B1": {"name": "B1", "path": "/path/to/B1"},
"C1": {"name": "C1", "path": "/path/to/C1"},
"D1": {"name": "D1", "path": "/path/to/D1"},
},
"bar2": {
"A2": {"name": "A2", "path": "/path/to/A2"},
"B2": {"name": "B2", "path": "/path/to/B2"},
"C2": {"name": "C2", "path": "/path/to/C2"},
"D2": {"name": "D2", "path": "/path/to/D2"},
},
}
}
list_files = []
for key in d["foo"]:
for nested_key in d["foo"][key]:
_tmp = f"tmp/{key}/{nested_key}.bam"
list_files.append(_tmp)
print(*list_files, sep="\n")
#tmp/bar1/A1.bam
#tmp/bar1/B1.bam
#tmp/bar1/C1.bam
#tmp/bar1/D1.bam
#tmp/bar2/A2.bam
#tmp/bar2/B2.bam
#tmp/bar2/C2.bam
#tmp/bar2/D2.bam
@SultanOrazbayev 有权这样做,但只是提出几个替代方案。
如果您喜欢这些循环,那么编写它的 pythonic 方式是使用列表理解。如果您有巨大的文件列表,您可能会注意到性能有所提高。
list_files = [
f"tmp/{key}/{nested_key}.bam"
for key in d["foo"]
for nested_key in d["foo"][key]
]
我认为使用 expand 的唯一方法基本上是构建相同的列表。我将它作为字典传入,也保留通配符名称,尽管元组会更有效率。 expand 的优点是,如果您将文件名放在配置变量中并且无法轻松格式化它,想要保留有意义的通配符名称,或者对其他通配符使用 allow_missing:
wcs = [{'sample_set': sample_set, 'sample': sample}
for sample_set in d["foo"]
for sample in d["foo"][sample_set]
]
list_files = expand("tmp/{sample_set}/{sample}.bam", zip,
sample_set=[wc['sample_set'] for wc in wcs],
sample=[wc['sample'] for wc in wcs],
)
有时 snakemake 方式不是 pythonic!