snakemake 中的互连变量

Question

假设我有样本 SAMPLE_A，分为两个文件 SAMPLE_A_1, SAMPLE_A_2 并关联到条形码 AATT, TTAA，SAMPLE_B 关联到条形码 CCGG, GGCC, GCGC，分为 4 个文件 SAMPLE_B_1...SAMPLE_B_4.

我可以创建 getSampleNames() 以获得 [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B] 和 [1,2,1,2,3,4]，然后压缩它们以获得组合 {sample}_{id}。然后我可以对条形码做同样的事情：[SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B] 和 [AATT, TTAA,CCGG, GGCC, GCGC].

SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))

rule refine:
input:
    '{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
    bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
    "isoseq3 refine {input} "


rule split:
input:
    expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
    expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
    "python {params.script_dir}/split_cells_bam.py"


rule dedup_split:
input:
    "{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
    bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
    "isoseq3 dedup {input} {output.bam} "

rule merge:
input:
    expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
        zip, sample = SAMPLES_BC, barcode = BCs),

如何防止规则拆分成为我的管道中的瓶颈？现在它等待对所有样本完成细化规则，虽然没有必要，每个样本都应该运行独立，但我不能，因为每个样本的条形码集不同。有没有办法让

expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs[SAMPLES_BC])，其中 SAMPLES_BC 的每个 {sample} 都是 BCs 字典中的一个键？ IDs 也一样吗？我知道我可以使用函数，但是我不确定如何通过规则

传播 {barcode}

Answer 1

根据您的评论，有几条路线可以涉及更改包含样本、条形码和 ID 的数据结构。现在，您可以只为每个样本创建一个规则：

for sample in set(SAMPLES_ID):  # get uniq samples
    # get ids and barcodes for this sample
    ids = [tup[1] for tup in zip(SAMPLES_ID, IDs) if tup[0] == sample]
    bcs = [tup[1] for tup in zip(SAMPLES_BC, BCs) if tup[0] == sample]

    rule:
        name: f'{sample}_split'
        input:
            expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', 
                   sample = sample, id = ids),
        output:
            expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", 
                   sample = sample, barcode = bcs),
        shell:
            "python {params.script_dir}/split_cells_bam.py"

展开时不需要 zip，因为 ids 和 bcs 是针对单个样本的。我不认为这通常是最好的方法，但对于您当前的工作流程来说是最简单的。

刚刚注意到您的 shell 命令，您如何将 input/output 传递给您的脚本？

Answer 2

我找到了通过函数使用字典的方法，解决了我的问题！

此解决方案的主要默认设置是您必须创建一个虚拟文件作为拆分规则的输出，而不是检查每个 '{sample}/cells/{barcode}_{sample}/fltnc.bam' 文件已创建，所以我仍在寻找更优雅的东西...

IDs = getSampleNames() #{SAMPLE_A:[1,2], SAMPLE_B:[1,2,3,4]}
SAMPLES = list(IDs.keys()) 
BCs = getBCs(SAMPLES) #{SAMPLE_A:[AATT, TTAA], SAMPLE_B:[CCGG,GGCC,GCGC]}
    
# function linking IDs and SAMPLE
def sample2ids(wildcards):
    return expand('{{sample}}/polyA_trimming/{{sample}}_{id}.fltnc.bam', 
               id = IDs[wildcards.sample])

# function linking BCs and SAMPLE
def sample2ids(wildcards):
    return expand('{{sample}}/cells/{barcode}_{{sample}}/dedup/dedup.bam',
               barcode = BCs[wildcards.sample])

rule refine:
input:
    '{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
    bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',

rule split:
input:
    sample2ids
output:
    # cannot use a function here, so I create a dummy file to pipe
    'dummy_file.txt'

rule dedup_split:
input:
    'dummy_file.txt'
output:
    bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",


rule merge:
input:
    sample2bc

snakemake 中的互连变量

Interconnected variables in snakemake

snakemake