Snakemake:通过一个通配符组合输入
Snakemake: combine inputs over one wildcard
我想知道您是否可以就定义一个 Snakemake 规则来组合一个但不是所有的通配符提出建议?我的数据经过组织,因此我有 运行 和样本;大多数(但不是所有)样本在每个 运行 中都被重新测序。因此,我有每个样本的预处理步骤 - 运行。然后,我有一个步骤为每个 运行 每个样本组合 BAM 文件。但是,我 运行 遇到的问题是我有点困惑如何定义规则,以便我可以列出对应于一个样本。
为了清楚起见,我将整个管道放在下面,但我真正的问题是规则 combine_bams。如何列出输入中单个样本的所有 bams?
任何建议都会很棒!非常感谢您!
# Define samples and runs
RUNS, SAMPLES = glob_wildcards("/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz")
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
rule all:
input:
#trim = ['process/trim/{run}_{samp}_trim_1.fq.gz'.format(samp=sample_id, run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)],
trim = expand(['process/trim/{run}_{samp}_trim_1.fq.gz'], zip, run = RUNS, samp = SAMPLES),
kraken=expand('process/trim/{run}_{samp}_trim_kr_1.fq.gz', zip, run = RUNS, samp = SAMPLES),
bams=expand('process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam', zip, run = RUNS, samp = SAMPLES), # add fixed ref/mapper (expand with zip doesn't allow these to repeate)
combined_bams=expand('process/bams/{samp}_bwa_MTB_ancestor_reference.merged.rmdup.bam', samp = np.unique(SAMPLES))
# Trim reads for quality.
rule trim_reads:
input:
p1='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz', # update inputs so they only include those that exist use zip.
p2='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R2_001.fastq.gz'
output:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
log:
'process/trim/{run}_{samp}_trim_reads.log'
shell:
'/labs/jandr/walter/tb/scripts/trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'
# Filter reads taxonomically with Kraken.
rule taxonomic_filter:
input:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
output:
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz',
kraken_stats='process/trim/{run}_{samp}_kraken.report'
log:
'process/trim/{run}_{samp}_run_kraken.log'
threads: 8
shell:
'/labs/jandr/walter/tb/scripts/run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_stats} &>> {log}'
# Map reads.
rule map_reads:
input:
ref_path='/labs/jandr/walter/tb/data/refs/{ref}.fasta.gz',
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/labs/jandr/walter/tb/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
# Combine reads and remove duplicates (per sample).
rule combine_bams:
input:
bams = 'process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam'
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"
创建字典以将每个样本与其运行列表相关联。
然后对于 combine_bams 规则,使用输入函数使用字典为该样本生成输入文件。
rule combine_bams:
input:
bams = lambda wildcards: expand('process/bams/{run}_{{samp}}_bwa_MTB_ancestor_reference_rg_sorted.bam', run=sample_dict[wildcards.sample])
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"
我想知道您是否可以就定义一个 Snakemake 规则来组合一个但不是所有的通配符提出建议?我的数据经过组织,因此我有 运行 和样本;大多数(但不是所有)样本在每个 运行 中都被重新测序。因此,我有每个样本的预处理步骤 - 运行。然后,我有一个步骤为每个 运行 每个样本组合 BAM 文件。但是,我 运行 遇到的问题是我有点困惑如何定义规则,以便我可以列出对应于一个样本。
为了清楚起见,我将整个管道放在下面,但我真正的问题是规则 combine_bams。如何列出输入中单个样本的所有 bams?
任何建议都会很棒!非常感谢您!
# Define samples and runs
RUNS, SAMPLES = glob_wildcards("/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz")
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
rule all:
input:
#trim = ['process/trim/{run}_{samp}_trim_1.fq.gz'.format(samp=sample_id, run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)],
trim = expand(['process/trim/{run}_{samp}_trim_1.fq.gz'], zip, run = RUNS, samp = SAMPLES),
kraken=expand('process/trim/{run}_{samp}_trim_kr_1.fq.gz', zip, run = RUNS, samp = SAMPLES),
bams=expand('process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam', zip, run = RUNS, samp = SAMPLES), # add fixed ref/mapper (expand with zip doesn't allow these to repeate)
combined_bams=expand('process/bams/{samp}_bwa_MTB_ancestor_reference.merged.rmdup.bam', samp = np.unique(SAMPLES))
# Trim reads for quality.
rule trim_reads:
input:
p1='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz', # update inputs so they only include those that exist use zip.
p2='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R2_001.fastq.gz'
output:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
log:
'process/trim/{run}_{samp}_trim_reads.log'
shell:
'/labs/jandr/walter/tb/scripts/trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'
# Filter reads taxonomically with Kraken.
rule taxonomic_filter:
input:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
output:
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz',
kraken_stats='process/trim/{run}_{samp}_kraken.report'
log:
'process/trim/{run}_{samp}_run_kraken.log'
threads: 8
shell:
'/labs/jandr/walter/tb/scripts/run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_stats} &>> {log}'
# Map reads.
rule map_reads:
input:
ref_path='/labs/jandr/walter/tb/data/refs/{ref}.fasta.gz',
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/labs/jandr/walter/tb/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
# Combine reads and remove duplicates (per sample).
rule combine_bams:
input:
bams = 'process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam'
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"
创建字典以将每个样本与其运行列表相关联。
然后对于 combine_bams 规则,使用输入函数使用字典为该样本生成输入文件。
rule combine_bams:
input:
bams = lambda wildcards: expand('process/bams/{run}_{{samp}}_bwa_MTB_ancestor_reference_rg_sorted.bam', run=sample_dict[wildcards.sample])
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"