在 Snakemake 中组合多个通配符
Combine multiple wildcards in Snakemake
├── DIR1
│ ├── smp1.fastq.gz
│ ├── smp1_fastqc/
│ ├── smp2.fastq.gz
│ └── smp2_fastqc/
└── DIR2
├── smp3.fastq.gz
├── smp3_fastqc/
├── smp4.fastq.gz
└── smp4_fastqc/
我想按样本计算读取次数,然后按目录连接所有计数。
我创建了一个字典 link 示例 1 和 2 到目录 1,示例 3 和 4 到目录 2
DIRS,SAMPLES = glob_wildcards(INDIR+'/{dir}/{smp}.fastq.gz')
# Create samples missing
def filter_combinator(combinator, authlist):
def filtered_combinator(*args, **kwargs):
for wc_comb in combinator(*args, **kwargs):
if frozenset(wc_comb) in authlist:
yield wc_comb
return filtered_combinator
# Authentification
combine_dir_samples = []
for dir in DIRS:
samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
for smp in samples:
combine_dir_samples.append( { "dir" : dir, "smp" : smp} )
combine_dir_samples = { frozenset( x.items() ) for x in combine_dir_samples }
dir_samples = filter_combinator(product, combine_dir_samples)
然后,我创建了一个规则来按样本计算我的读数
rule all:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
rule countReads:
input:
INDIR+'/{dir}/{smp}_fastqc/fastqc_data.txt'
output:
INDIR+'/{dir}/{smp}_Nreads.txt'
shell:
"grep 'Total\ Sequences' {input} | awk '{{print {wildcards.dir},}}' > {output}"
---------------------------------------------------------------
# result ok
├── DIR1
│ ├── smp1_Nreads.txt
│ └── smp2_Nreads.txt
└── DIR2
├── smp3_Nreads.txt
└── smp4_Nreads.txt
> cat smp1_Nreads.txt
DIR1 15082186
但是,我想添加一个规则来按目录
连接我的 smp_Nreads.txt
文件
rule concatNreads:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
output:
INDIR+'/{dir}/Nreads_{dir}.txt'
shell:
"cat {input} > {output}"
------------------------------------------------------------------
# result
├── DIR1
│ └── Nreads_DIR1.txt
└── DIR2
└── Nreads_DIR2.txt
# but both files are identical
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
DIR2 11635831
DIR2 45924459
# I would like to have
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
> cat Nreads_DIR2.txt
DIR2 11635831
DIR2 45924459
我为我的 concat 规则尝试了不同的输入语法
expand(OUTFastq+'/{dir}/FastQC/{{smp}}_Nreads.txt', dir_samples, dir=DIRS)
lambda wildcards: expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
我没有找到任何解决方案,就好像它不关心我的字典的这条规则。
编辑
我尝试使用字典而不是我的组合 filter_combinator
并使用函数作为我的规则的输入来获取样本。
dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}
def func(dir):
return dir_to_samples[dir]
rule all:
input:
lambda wildcards: expand(OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip', dir=wildcards.dir, smp=func(wildcards.dir))
rule fastQC:
input:
lambda wildcards: expand(INDIR+'/{dir}/{smp}.fastq.gz', dir=wildcards.dir, smp=func(wildcards.dir))
output:
OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip'
shell:
"fastqc {input} -o {OUTDIR}/{wildcards.dir}/FastQC/"
> AttributeError: 'Wildcards' object has no attribute 'dir'
首先,我认为您使解决方案过于复杂,使它对 Snakemake 而言不那么惯用。因此,您在实施规则时遇到了问题。不管怎样,让我按照你问的形式回答问题。
您的两个 Nreads_DIRx.txt
文件完全相同也就不足为奇了,因为输入不依赖于输出中的任何通配符:
rule concatNreads:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
此处 expand
函数解析 dir
和 smp
变量,生成完整指定文件名的列表。您需要的是真正取决于输出中的通配符的东西:
rule concatNreads:
input:
lambda wildcards: ...
{dir}
完全由输出中的通配符指定,因此您无需从 DIRS
变量中为其分配值:
rule concatNreads:
input:
lambda wildcards: expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir=wildcards.dir, smp=func(wildcards.dir))
现在的问题是如何实现这个 func
函数来生成目录的样本列表。我花了一段时间才理解您使用 combine_dir_samples
和 filter_combinator
的技巧,所以我将其留给您使用该代码实现 func
函数。但您真正需要的是来自 DIR -> SAMPLES:
的地图
dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}
def func(dir):
return dir_to_samples[dir]
这个 dir_to_samples
可能更容易评估,但这是您修改后的解决方案:
for dir in DIRS:
samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
dir_to_samples.append({dir: samples})
├── DIR1
│ ├── smp1.fastq.gz
│ ├── smp1_fastqc/
│ ├── smp2.fastq.gz
│ └── smp2_fastqc/
└── DIR2
├── smp3.fastq.gz
├── smp3_fastqc/
├── smp4.fastq.gz
└── smp4_fastqc/
我想按样本计算读取次数,然后按目录连接所有计数。
我创建了一个字典 link 示例 1 和 2 到目录 1,示例 3 和 4 到目录 2
DIRS,SAMPLES = glob_wildcards(INDIR+'/{dir}/{smp}.fastq.gz')
# Create samples missing
def filter_combinator(combinator, authlist):
def filtered_combinator(*args, **kwargs):
for wc_comb in combinator(*args, **kwargs):
if frozenset(wc_comb) in authlist:
yield wc_comb
return filtered_combinator
# Authentification
combine_dir_samples = []
for dir in DIRS:
samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
for smp in samples:
combine_dir_samples.append( { "dir" : dir, "smp" : smp} )
combine_dir_samples = { frozenset( x.items() ) for x in combine_dir_samples }
dir_samples = filter_combinator(product, combine_dir_samples)
然后,我创建了一个规则来按样本计算我的读数
rule all:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
rule countReads:
input:
INDIR+'/{dir}/{smp}_fastqc/fastqc_data.txt'
output:
INDIR+'/{dir}/{smp}_Nreads.txt'
shell:
"grep 'Total\ Sequences' {input} | awk '{{print {wildcards.dir},}}' > {output}"
---------------------------------------------------------------
# result ok
├── DIR1
│ ├── smp1_Nreads.txt
│ └── smp2_Nreads.txt
└── DIR2
├── smp3_Nreads.txt
└── smp4_Nreads.txt
> cat smp1_Nreads.txt
DIR1 15082186
但是,我想添加一个规则来按目录
连接我的smp_Nreads.txt
文件
rule concatNreads:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
output:
INDIR+'/{dir}/Nreads_{dir}.txt'
shell:
"cat {input} > {output}"
------------------------------------------------------------------
# result
├── DIR1
│ └── Nreads_DIR1.txt
└── DIR2
└── Nreads_DIR2.txt
# but both files are identical
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
DIR2 11635831
DIR2 45924459
# I would like to have
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
> cat Nreads_DIR2.txt
DIR2 11635831
DIR2 45924459
我为我的 concat 规则尝试了不同的输入语法
expand(OUTFastq+'/{dir}/FastQC/{{smp}}_Nreads.txt', dir_samples, dir=DIRS)
lambda wildcards: expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
我没有找到任何解决方案,就好像它不关心我的字典的这条规则。
编辑
我尝试使用字典而不是我的组合 filter_combinator
并使用函数作为我的规则的输入来获取样本。
dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}
def func(dir):
return dir_to_samples[dir]
rule all:
input:
lambda wildcards: expand(OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip', dir=wildcards.dir, smp=func(wildcards.dir))
rule fastQC:
input:
lambda wildcards: expand(INDIR+'/{dir}/{smp}.fastq.gz', dir=wildcards.dir, smp=func(wildcards.dir))
output:
OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip'
shell:
"fastqc {input} -o {OUTDIR}/{wildcards.dir}/FastQC/"
> AttributeError: 'Wildcards' object has no attribute 'dir'
首先,我认为您使解决方案过于复杂,使它对 Snakemake 而言不那么惯用。因此,您在实施规则时遇到了问题。不管怎样,让我按照你问的形式回答问题。
您的两个 Nreads_DIRx.txt
文件完全相同也就不足为奇了,因为输入不依赖于输出中的任何通配符:
rule concatNreads:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
此处 expand
函数解析 dir
和 smp
变量,生成完整指定文件名的列表。您需要的是真正取决于输出中的通配符的东西:
rule concatNreads:
input:
lambda wildcards: ...
{dir}
完全由输出中的通配符指定,因此您无需从 DIRS
变量中为其分配值:
rule concatNreads:
input:
lambda wildcards: expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir=wildcards.dir, smp=func(wildcards.dir))
现在的问题是如何实现这个 func
函数来生成目录的样本列表。我花了一段时间才理解您使用 combine_dir_samples
和 filter_combinator
的技巧,所以我将其留给您使用该代码实现 func
函数。但您真正需要的是来自 DIR -> SAMPLES:
dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}
def func(dir):
return dir_to_samples[dir]
这个 dir_to_samples
可能更容易评估,但这是您修改后的解决方案:
for dir in DIRS:
samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
dir_to_samples.append({dir: samples})