Snakemake 输入规则定义通过 lambda + Pandas dataframe

Snakemake input rule defintion via lambda + Pandas dataframe

抱歉,如果这可能是其他问题的重复,但我无法调试我的情况。 得到这样的数据框:

       Sample  gender phenotype subject_id
0  ERR35175    male     tumor         13
1  ERR35176    male   control         13
2  ERR35177  female     tumor         14
3  ERR35178  female   control         14
4  ERR35179    male     tumor         16
5  ERR35180    male   control         16

给定一个 subject_id,我需要输入来自数据帧的 tumorcontrol 样本,与来自配置和文件终止的路径连接,以生成一个使用 subject_id 的输出。为此,我编写了这条规则(在文件 snv_calling.smk 下):

rule Mutect2:
    input:
        tumor=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "tumor") & (df.subject_id == [wc.patient])].Sample.values[0],
        normal=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "control") & (df.subject_id == [wc.patient])].Sample.values[0]
    output:
        vcf=config["datadirs"]["VCF"]+"/{patient}.vcf"
    shell:
    """
    gatk Mutect2 \
    -I {input.tumor} \
    -I {input.normal} \
    -O {output.vcf}
    """

里面Snakefile:

PATIENT=['13','14','16']
rule all:
    input:
        expand(config["datadirs"]["VCF"]+"/"+"{patient}.vcf", patient=PATIENT)

它给了我这个错误,其中第 37 行是第一个输入参数:

InputFunctionException in line 37 of ../rules/snv_calling.smk:
Error:
  ValueError: ('Lengths must match to compare', (0,), (1,))
Wildcards:
  patient=13
Traceback:
  File "../rules/snv_calling.smk", line 45, in <lambda>

我很难理解发生了什么,因为似乎从错误中正确分配了通配符 patient。如果我 运行 Snakemake 之外的函数,则 PATIENT 列表没有错误。

参数存储在数据框中,并且有一个方便的实用程序可以处理表格参数,Paramspace。以下是对您的具体情况的粗略了解,但需要对命令语法和路径进行一些调整。

第一步是重塑数据以简化工作流程:

from io import StringIO

import pandas as pd

data = StringIO(
    """index Sample  gender phenotype subject_id
0  ERR35175    male     tumor         13
1  ERR35176    male   control         13
2  ERR35177  female     tumor         14
3  ERR35178  female   control         14
4  ERR35179    male     tumor         16
5  ERR35180    male   control         16"""
)

df = pd.read_csv(data, sep="\s+")

df = df.pivot(
    index=["subject_id", "gender"], values="Sample", columns="phenotype"
).reset_index()

#phenotype  subject_id  gender   control     tumor
#0                  13    male  ERR35176  ERR35175
#1                  14  female  ERR35178  ERR35177
#2                  16    male  ERR35180  ERR35179

现在,创建参数 space:

from snakemake.utils import Paramspace
paramspace = Paramspace(df, filename_params='*')

最后修改规则使用参数space:

rule all:
    input:
        paramspace.instance_patterns

rule Mutect2:
    output:
        done=touch(paramspace.wildcard_pattern),
    params:
        parameters=paramspace.instance,
    shell:
        """
        echo {params.parameters[gender]}
        echo {params.parameters[tumor]}
        echo {params.parameters[control]}
        """

更新:

可以调整它以使用中间输出。参数 space 充当 pandas 数据框,因此可以 select 感兴趣的列:

rule all:
    input:
        paramspace.instance_patterns,


rule some_rule:
    output:
        test=paramspace[["gender"]].wildcard_pattern,


rule Mutect2:
    input:
        test=paramspace[["gender"]].wildcard_pattern,
    output:
        done=touch(paramspace.wildcard_pattern),
    params:
        parameters=paramspace.instance,
    shell:
        """
        echo {params.parameters[gender]}
        echo {params.parameters[tumor]}
        echo {params.parameters[control]}
        """