Snakemake 输入规则定义通过 lambda + Pandas dataframe
Snakemake input rule defintion via lambda + Pandas dataframe
抱歉,如果这可能是其他问题的重复,但我无法调试我的情况。
得到这样的数据框:
Sample gender phenotype subject_id
0 ERR35175 male tumor 13
1 ERR35176 male control 13
2 ERR35177 female tumor 14
3 ERR35178 female control 14
4 ERR35179 male tumor 16
5 ERR35180 male control 16
给定一个 subject_id
,我需要输入来自数据帧的 tumor
和 control
样本,与来自配置和文件终止的路径连接,以生成一个使用 subject_id
的输出。为此,我编写了这条规则(在文件 snv_calling.smk
下):
rule Mutect2:
input:
tumor=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "tumor") & (df.subject_id == [wc.patient])].Sample.values[0],
normal=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "control") & (df.subject_id == [wc.patient])].Sample.values[0]
output:
vcf=config["datadirs"]["VCF"]+"/{patient}.vcf"
shell:
"""
gatk Mutect2 \
-I {input.tumor} \
-I {input.normal} \
-O {output.vcf}
"""
里面Snakefile
:
PATIENT=['13','14','16']
rule all:
input:
expand(config["datadirs"]["VCF"]+"/"+"{patient}.vcf", patient=PATIENT)
它给了我这个错误,其中第 37 行是第一个输入参数:
InputFunctionException in line 37 of ../rules/snv_calling.smk:
Error:
ValueError: ('Lengths must match to compare', (0,), (1,))
Wildcards:
patient=13
Traceback:
File "../rules/snv_calling.smk", line 45, in <lambda>
我很难理解发生了什么,因为似乎从错误中正确分配了通配符 patient
。如果我 运行 Snakemake 之外的函数,则 PATIENT
列表没有错误。
参数存储在数据框中,并且有一个方便的实用程序可以处理表格参数,Paramspace
。以下是对您的具体情况的粗略了解,但需要对命令语法和路径进行一些调整。
第一步是重塑数据以简化工作流程:
from io import StringIO
import pandas as pd
data = StringIO(
"""index Sample gender phenotype subject_id
0 ERR35175 male tumor 13
1 ERR35176 male control 13
2 ERR35177 female tumor 14
3 ERR35178 female control 14
4 ERR35179 male tumor 16
5 ERR35180 male control 16"""
)
df = pd.read_csv(data, sep="\s+")
df = df.pivot(
index=["subject_id", "gender"], values="Sample", columns="phenotype"
).reset_index()
#phenotype subject_id gender control tumor
#0 13 male ERR35176 ERR35175
#1 14 female ERR35178 ERR35177
#2 16 male ERR35180 ERR35179
现在,创建参数 space:
from snakemake.utils import Paramspace
paramspace = Paramspace(df, filename_params='*')
最后修改规则使用参数space:
rule all:
input:
paramspace.instance_patterns
rule Mutect2:
output:
done=touch(paramspace.wildcard_pattern),
params:
parameters=paramspace.instance,
shell:
"""
echo {params.parameters[gender]}
echo {params.parameters[tumor]}
echo {params.parameters[control]}
"""
更新:
可以调整它以使用中间输出。参数 space 充当 pandas 数据框,因此可以 select 感兴趣的列:
rule all:
input:
paramspace.instance_patterns,
rule some_rule:
output:
test=paramspace[["gender"]].wildcard_pattern,
rule Mutect2:
input:
test=paramspace[["gender"]].wildcard_pattern,
output:
done=touch(paramspace.wildcard_pattern),
params:
parameters=paramspace.instance,
shell:
"""
echo {params.parameters[gender]}
echo {params.parameters[tumor]}
echo {params.parameters[control]}
"""
抱歉,如果这可能是其他问题的重复,但我无法调试我的情况。 得到这样的数据框:
Sample gender phenotype subject_id
0 ERR35175 male tumor 13
1 ERR35176 male control 13
2 ERR35177 female tumor 14
3 ERR35178 female control 14
4 ERR35179 male tumor 16
5 ERR35180 male control 16
给定一个 subject_id
,我需要输入来自数据帧的 tumor
和 control
样本,与来自配置和文件终止的路径连接,以生成一个使用 subject_id
的输出。为此,我编写了这条规则(在文件 snv_calling.smk
下):
rule Mutect2:
input:
tumor=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "tumor") & (df.subject_id == [wc.patient])].Sample.values[0],
normal=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "control") & (df.subject_id == [wc.patient])].Sample.values[0]
output:
vcf=config["datadirs"]["VCF"]+"/{patient}.vcf"
shell:
"""
gatk Mutect2 \
-I {input.tumor} \
-I {input.normal} \
-O {output.vcf}
"""
里面Snakefile
:
PATIENT=['13','14','16']
rule all:
input:
expand(config["datadirs"]["VCF"]+"/"+"{patient}.vcf", patient=PATIENT)
它给了我这个错误,其中第 37 行是第一个输入参数:
InputFunctionException in line 37 of ../rules/snv_calling.smk:
Error:
ValueError: ('Lengths must match to compare', (0,), (1,))
Wildcards:
patient=13
Traceback:
File "../rules/snv_calling.smk", line 45, in <lambda>
我很难理解发生了什么,因为似乎从错误中正确分配了通配符 patient
。如果我 运行 Snakemake 之外的函数,则 PATIENT
列表没有错误。
参数存储在数据框中,并且有一个方便的实用程序可以处理表格参数,Paramspace
。以下是对您的具体情况的粗略了解,但需要对命令语法和路径进行一些调整。
第一步是重塑数据以简化工作流程:
from io import StringIO
import pandas as pd
data = StringIO(
"""index Sample gender phenotype subject_id
0 ERR35175 male tumor 13
1 ERR35176 male control 13
2 ERR35177 female tumor 14
3 ERR35178 female control 14
4 ERR35179 male tumor 16
5 ERR35180 male control 16"""
)
df = pd.read_csv(data, sep="\s+")
df = df.pivot(
index=["subject_id", "gender"], values="Sample", columns="phenotype"
).reset_index()
#phenotype subject_id gender control tumor
#0 13 male ERR35176 ERR35175
#1 14 female ERR35178 ERR35177
#2 16 male ERR35180 ERR35179
现在,创建参数 space:
from snakemake.utils import Paramspace
paramspace = Paramspace(df, filename_params='*')
最后修改规则使用参数space:
rule all:
input:
paramspace.instance_patterns
rule Mutect2:
output:
done=touch(paramspace.wildcard_pattern),
params:
parameters=paramspace.instance,
shell:
"""
echo {params.parameters[gender]}
echo {params.parameters[tumor]}
echo {params.parameters[control]}
"""
更新:
可以调整它以使用中间输出。参数 space 充当 pandas 数据框,因此可以 select 感兴趣的列:
rule all:
input:
paramspace.instance_patterns,
rule some_rule:
output:
test=paramspace[["gender"]].wildcard_pattern,
rule Mutect2:
input:
test=paramspace[["gender"]].wildcard_pattern,
output:
done=touch(paramspace.wildcard_pattern),
params:
parameters=paramspace.instance,
shell:
"""
echo {params.parameters[gender]}
echo {params.parameters[tumor]}
echo {params.parameters[control]}
"""