Snakemake - 使用扩展异常的多对一
Snakemake - Many to one using an expand exception
我有一个功能正常的 Snakefile,用于使用多个床文件进行分区遗传。这会使用 snakemake -np
生成一个完美的作业列表,所以这个文件只需要一个小的调整(我希望!)。
我的问题出现在下面的 merge_peaks
规则中。
现阶段我有25个床位档案,需要运行5次呼叫merge_peaks
,每个分机一次呼叫ext=[100,200,300,400,500]
,所以我只需要5个床位档案包含每次调用的相关分机。
例如下面的merge_peaks
输出文件peak_files/Fullard2018_peaks.mrgd.blrm.100.bed
,我只需要以下5个ext=100
的床文件作为输入:
peak_files/fullard2018_NpfcATAC_1.blrm.100.bed
peak_files/fullard2018_NpfcATAC_2.blrm.100.bed
peak_files/fullard2018_NpfcATAC_3.blrm.100.bed
peak_files/fullard2018_NpfcATAC_4.blrm.100.bed
peak_files/fullard2018_NpfcATAC_5.blrm.100.bed
这是我的配置文件:
samples:
fullard2018_NpfcATAC_1:
fullard2018_NpfcATAC_2:
fullard2018_NpfcATAC_3:
fullard2018_NpfcATAC_4:
fullard2018_NpfcATAC_5:
index: /home/genomes_and_index_files/hg19.chrom.sizes
这是 Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("peak_files/{sample}.blrm.{ext}.bed", sample=config['samples'], ext=[100,200,300,400,500]),
expand("LD_annotation_files/Fullard2018.{ext}.{chr}.l2.ldscore.gz", sample=config['samples'], ext=[100,200,300,400,500], chr=range(1,23))
rule annot2bed:
input:
folder = "Reference/baseline"
params:
file = "Reference/baseline/baseline.{chr}.annot.gz"
output:
"LD_annotation_files/baseline.{chr}_no_head.bed"
shell:
"zcat {params.file} | tail -n +2 |awk -v OFS=\"\t\" '{{print \"chr\", -1, , , }}' "
"| sort -k1,1 -k2,2n > {output}"
rule extend_bed:
input:
"peak_files/{sample}_peaks.blrm.narrowPeak"
output:
"peak_files/{sample}.blrm.{ext}.bed"
params:
ext = "{ext}",
index = config["index"]
shell:
"bedtools slop -i {input} -g {params.index} -b {params.ext} > {output}"
rule merge_peaks:
input:
expand("peak_files/{sample}.blrm.{ext}.bed", sample = config['samples'], ext=[100,200,300,400,500])
output:
"peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed"
shell:
"cat {input} | bedtools sort -i stdin | bedtools merge -i stdin > {output}"
rule intersect_mybed:
input:
annot = rules.annot2bed.output,
mybed = rules.merge_peaks.output
output:
"LD_annotation_files/Fullard2018.{ext}.{chr}.annot.gz"
params:
uncompressed = "LD_annotation_files/Fullard2018.{ext}.{chr}.annot"
shell:
"echo -e \"CHR\tBP\tSNP\tCM\tANN\" > {params.uncompressed}; "
"/share/apps/bedtools intersect -a {input.annot} -b {input.mybed} -c | "
"sed 's/^chr//g' | awk -v OFS=\"\t\" '{{print , , , , }}' >> {params.uncompressed}; "
"gzip {params.uncompressed}"
rule ldsr:
input:
annot = "LD_annotation_files/Fullard2018.{ext}.{chr}.annot.gz",
bfile_folder = "Reference/1000G_plinkfiles",
snps_folder = "Reference/hapmap3_snps"
output:
"LD_annotation_files/Fullard2018.{ext}.{chr}.l2.ldscore.gz"
conda:
"envs/p2-ldscore.yaml"
params:
bfile = "Reference/1000G_plinkfiles/1000G.mac5eur.{chr}",
ldscores = "LD_annotation_files/Fullard2018.{ext}.{chr}",
snps = "Reference/hapmap3_snps/hm.{chr}.snp"
log:
"logs/LDSC/Fullard2018.{ext}.{chr}_ldsc.txt"
shell:
"export MKL_NUM_THREADS=2;" # Export arguments are workaround as ldsr uses all available cores
"export NUMEXPR_NUM_THREADS=2;" # Numbers must match cores parameter in cluster config
"Reference/ldsc/ldsc.py --l2 --bfile {params.bfile} --ld-wind-cm 1 "
"--annot {input.annot} --out {params.ldscores} --print-snps {params.snps} 2> {log}"
当前发生的情况是,所有 25 个 bed 文件都被输入到每次调用的合并峰值规则中,而我每次只需要输入 5 个具有相关扩展名的文件。我正在努力研究如何正确使用扩展功能或替代方法,以便在每次调用规则时仅包含和合并相关的床文件。
这个问题问了一些类似的问题,我想,但不是我要找的,因为它不使用配置文件 -
我喜欢 Snakemake,但我的 python 有点冒险。
非常感谢。
我有一个使用双大括号作为转义字符的解决方案。这有效并提供了我需要的东西,但仍有一些我不明白的地方。
所以对于 merge_peaks
规则的输入行,而不是:
expand("peak_files/{sample}.blrm.{ext}.bed", sample = config['samples'], ext=[100,200,300,400,500])
我把 {{}}
放在 ext
周围,像这样:
expand("peak_files/{sample}.blrm.{{ext}}.bed", sample = config['samples'], ext=[100,200,300,400,500])
这从 expand 语句中转义了 ext
,这样我在每次调用 merge_peaks
的输入语句中只得到包含单个扩展值的文件。但是,不是将 5 个文件传递给输入语句,而是针对特定扩展名每个样本一个文件(参见问题),我得到了传递给输入的同一文件的五个副本:
cat
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
| bedtools sort -i stdin | bedtools merge -i stdin > peak_files/Fullard2018_peaks.mrgd.blrm.500.bed`
最终这对解决方案没有影响,因为它提供了我需要继续前进的输出,但这意味着我不必要地合并相同的文件。所以一定有比这更好的答案。
如果我没理解错的话,您设法为每个样本每个扩展名创建了一个文件(总共 25 个文件),现在您想合并具有相同扩展名的文件。所以你需要的最终输出应该是:
peak_files/Fullard2018_peaks.mrgd.blrm.100.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.200.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.300.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.400.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.500.bed
(为了测试,我创建了 25 个要按扩展名合并的输入文件):
mkdir -p peak_files
for i in 100 200 300 400 500
do
touch peak_files/fullard2018_NpfcATAC_1.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_2.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_3.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_4.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_5.blrm.${i}.bed
done
这个 snakefile 应该可以完成这项工作。当然,您可以将 samples
和 exts
移动到配置条目:
samples= ['fullard2018_NpfcATAC_1',
'fullard2018_NpfcATAC_2',
'fullard2018_NpfcATAC_3',
'fullard2018_NpfcATAC_4',
'fullard2018_NpfcATAC_5']
exts= [100, 200, 300, 400, 500]
rule all:
input:
expand('peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed', ext= exts),
rule merge_peaks:
input:
lambda wildcards: expand('peak_files/{sample}.blrm.{ext}.bed',
sample= samples, ext= wildcards.ext),
output:
'peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed',
shell:
r"""
cat {input} > {output}
"""
merge_peaks
中的 lambda
函数说 对于每个扩展分机给我一个文件列表,"samples"[=26 中的每个样本一个文件=]
我有一个功能正常的 Snakefile,用于使用多个床文件进行分区遗传。这会使用 snakemake -np
生成一个完美的作业列表,所以这个文件只需要一个小的调整(我希望!)。
我的问题出现在下面的 merge_peaks
规则中。
现阶段我有25个床位档案,需要运行5次呼叫merge_peaks
,每个分机一次呼叫ext=[100,200,300,400,500]
,所以我只需要5个床位档案包含每次调用的相关分机。
例如下面的merge_peaks
输出文件peak_files/Fullard2018_peaks.mrgd.blrm.100.bed
,我只需要以下5个ext=100
的床文件作为输入:
peak_files/fullard2018_NpfcATAC_1.blrm.100.bed
peak_files/fullard2018_NpfcATAC_2.blrm.100.bed
peak_files/fullard2018_NpfcATAC_3.blrm.100.bed
peak_files/fullard2018_NpfcATAC_4.blrm.100.bed
peak_files/fullard2018_NpfcATAC_5.blrm.100.bed
这是我的配置文件:
samples:
fullard2018_NpfcATAC_1:
fullard2018_NpfcATAC_2:
fullard2018_NpfcATAC_3:
fullard2018_NpfcATAC_4:
fullard2018_NpfcATAC_5:
index: /home/genomes_and_index_files/hg19.chrom.sizes
这是 Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("peak_files/{sample}.blrm.{ext}.bed", sample=config['samples'], ext=[100,200,300,400,500]),
expand("LD_annotation_files/Fullard2018.{ext}.{chr}.l2.ldscore.gz", sample=config['samples'], ext=[100,200,300,400,500], chr=range(1,23))
rule annot2bed:
input:
folder = "Reference/baseline"
params:
file = "Reference/baseline/baseline.{chr}.annot.gz"
output:
"LD_annotation_files/baseline.{chr}_no_head.bed"
shell:
"zcat {params.file} | tail -n +2 |awk -v OFS=\"\t\" '{{print \"chr\", -1, , , }}' "
"| sort -k1,1 -k2,2n > {output}"
rule extend_bed:
input:
"peak_files/{sample}_peaks.blrm.narrowPeak"
output:
"peak_files/{sample}.blrm.{ext}.bed"
params:
ext = "{ext}",
index = config["index"]
shell:
"bedtools slop -i {input} -g {params.index} -b {params.ext} > {output}"
rule merge_peaks:
input:
expand("peak_files/{sample}.blrm.{ext}.bed", sample = config['samples'], ext=[100,200,300,400,500])
output:
"peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed"
shell:
"cat {input} | bedtools sort -i stdin | bedtools merge -i stdin > {output}"
rule intersect_mybed:
input:
annot = rules.annot2bed.output,
mybed = rules.merge_peaks.output
output:
"LD_annotation_files/Fullard2018.{ext}.{chr}.annot.gz"
params:
uncompressed = "LD_annotation_files/Fullard2018.{ext}.{chr}.annot"
shell:
"echo -e \"CHR\tBP\tSNP\tCM\tANN\" > {params.uncompressed}; "
"/share/apps/bedtools intersect -a {input.annot} -b {input.mybed} -c | "
"sed 's/^chr//g' | awk -v OFS=\"\t\" '{{print , , , , }}' >> {params.uncompressed}; "
"gzip {params.uncompressed}"
rule ldsr:
input:
annot = "LD_annotation_files/Fullard2018.{ext}.{chr}.annot.gz",
bfile_folder = "Reference/1000G_plinkfiles",
snps_folder = "Reference/hapmap3_snps"
output:
"LD_annotation_files/Fullard2018.{ext}.{chr}.l2.ldscore.gz"
conda:
"envs/p2-ldscore.yaml"
params:
bfile = "Reference/1000G_plinkfiles/1000G.mac5eur.{chr}",
ldscores = "LD_annotation_files/Fullard2018.{ext}.{chr}",
snps = "Reference/hapmap3_snps/hm.{chr}.snp"
log:
"logs/LDSC/Fullard2018.{ext}.{chr}_ldsc.txt"
shell:
"export MKL_NUM_THREADS=2;" # Export arguments are workaround as ldsr uses all available cores
"export NUMEXPR_NUM_THREADS=2;" # Numbers must match cores parameter in cluster config
"Reference/ldsc/ldsc.py --l2 --bfile {params.bfile} --ld-wind-cm 1 "
"--annot {input.annot} --out {params.ldscores} --print-snps {params.snps} 2> {log}"
当前发生的情况是,所有 25 个 bed 文件都被输入到每次调用的合并峰值规则中,而我每次只需要输入 5 个具有相关扩展名的文件。我正在努力研究如何正确使用扩展功能或替代方法,以便在每次调用规则时仅包含和合并相关的床文件。
这个问题问了一些类似的问题,我想,但不是我要找的,因为它不使用配置文件 -
我喜欢 Snakemake,但我的 python 有点冒险。
非常感谢。
我有一个使用双大括号作为转义字符的解决方案。这有效并提供了我需要的东西,但仍有一些我不明白的地方。
所以对于 merge_peaks
规则的输入行,而不是:
expand("peak_files/{sample}.blrm.{ext}.bed", sample = config['samples'], ext=[100,200,300,400,500])
我把 {{}}
放在 ext
周围,像这样:
expand("peak_files/{sample}.blrm.{{ext}}.bed", sample = config['samples'], ext=[100,200,300,400,500])
这从 expand 语句中转义了 ext
,这样我在每次调用 merge_peaks
的输入语句中只得到包含单个扩展值的文件。但是,不是将 5 个文件传递给输入语句,而是针对特定扩展名每个样本一个文件(参见问题),我得到了传递给输入的同一文件的五个副本:
cat
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
| bedtools sort -i stdin | bedtools merge -i stdin > peak_files/Fullard2018_peaks.mrgd.blrm.500.bed`
最终这对解决方案没有影响,因为它提供了我需要继续前进的输出,但这意味着我不必要地合并相同的文件。所以一定有比这更好的答案。
如果我没理解错的话,您设法为每个样本每个扩展名创建了一个文件(总共 25 个文件),现在您想合并具有相同扩展名的文件。所以你需要的最终输出应该是:
peak_files/Fullard2018_peaks.mrgd.blrm.100.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.200.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.300.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.400.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.500.bed
(为了测试,我创建了 25 个要按扩展名合并的输入文件):
mkdir -p peak_files
for i in 100 200 300 400 500
do
touch peak_files/fullard2018_NpfcATAC_1.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_2.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_3.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_4.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_5.blrm.${i}.bed
done
这个 snakefile 应该可以完成这项工作。当然,您可以将 samples
和 exts
移动到配置条目:
samples= ['fullard2018_NpfcATAC_1',
'fullard2018_NpfcATAC_2',
'fullard2018_NpfcATAC_3',
'fullard2018_NpfcATAC_4',
'fullard2018_NpfcATAC_5']
exts= [100, 200, 300, 400, 500]
rule all:
input:
expand('peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed', ext= exts),
rule merge_peaks:
input:
lambda wildcards: expand('peak_files/{sample}.blrm.{ext}.bed',
sample= samples, ext= wildcards.ext),
output:
'peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed',
shell:
r"""
cat {input} > {output}
"""
merge_peaks
中的 lambda
函数说 对于每个扩展分机给我一个文件列表,"samples"[=26 中的每个样本一个文件=]