Snakemake

Question

我是 snakemake 的新手，而且对 python 也不是很流利（所以很抱歉，这可能是一个非常基本的愚蠢问题）：

我目前正在构建一个管道来分析一组带有 atlas 的 bamfile。这些 bamfile 位于不同的文件夹中，不应移动到一个公共文件夹中。因此我决定提供一个看起来像这样的样本列表（这只是一个例子，实际上样本可能在完全不同的驱动器上）：

Sample     Path
Sample1    /some/path/to/my/sample/
Sample2    /some/different/path/

并将其加载到我的 config.yaml 中：

sample_file: /path/to/samplelist/samplslist.txt

现在到我的 Snakefile：

import pandas as pd

#define configfile with paths etc.
configfile: "config.yaml"

#read-in dataframe and define Sample and Path
SAMPLES = pd.read_table(config["sample_file"])
BAMFILE = SAMPLES["Sample"]
PATH = SAMPLES["Path"]

rule all:
    input:
        expand("{path}{sample}.summary.txt", zip, path=PATH, sample=BAMFILE)

#this works like a charm as long as I give the zip-function in the rules 'all' and 'summary':

rule indexBam:
    input: 
        "{path}{sample}.bam"
    output:
        "{path}{sample}.bam.bai"
    shell:
        "samtools index {input}"

#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
    input:
        bam="{path}{sample}.bam",
        bai=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE)
    params:
        prefix="analysis/BAMDiagnostics/{sample}"   
    output:
        "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
        "analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
        "analysis/BAMDiagnostics/{sample}_MQ.txt",
        "analysis/BAMDiagnostics/{sample}_readLength.txt",
        "analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
    message:
        "running BamDiagnostics...{wildcards.sample}"
    shell:
        "{config[atlas]} task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"

rule summary:
    input:
        index=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE),
        bamd=expand("analysis/BAMDiagnostics/{sample}_approximateDepth.txt", sample=BAMFILE)
    output:
        "{path}{sample}.summary.txt"
    shell:
        "echo -e '{input.index} {input.bamd}"

我收到错误

WildcardError in line 28 of path/to/my/Snakefile: Wildcards in input files cannot be determined from output files: 'path'

谁能帮帮我？
- 我试图用 join 或创建输入函数来解决这个问题，但我认为我不够熟练，无法看到我的错误...
- 我想问题是，我的摘要规则不包含带有 {path} 的 bamdiagnostics-output 连音（因为输出在其他地方）并且无法连接到输入文件等等.. .
- 扩展我对 bamdiagnostics-rule 的输入使代码工作，但当然将每个样本输入到每个样本输出并造成大混乱： In this case, both bamfiles are used for the creation of each outputfile. This is wrong as the samples AND the output are to be treated independently.

Answer 1

根据 atlas 文档，您似乎需要运行每个样本的每个规则，这里的复杂之处在于每个样本都在单独的路径中。

我修改了您的脚本以适用于上述情况（参见 DAG）。修改了脚本开头的变量以使其更有意义。出于演示目的删除了 config，并使用了 pathlib 库（而不是 os.path.join）。 pathlib 不是必需的，但它可以帮助我保持理智。修改了 shell 命令以避免 config.

import pandas as pd
from pathlib import Path

df = pd.read_csv('sample.tsv', sep='\t', index_col='Sample')
SAMPLES = df.index
BAM_PATH = df["Path"]
# print (BAM_PATH['sample1'])

rule all:
    input:
        expand("{path}{sample}.summary.txt", zip, path=BAM_PATH, sample=SAMPLES)


rule indexBam:
    input:
        str( Path("{path}") / "{sample}.bam")
    output:
        str( Path("{path}") / "{sample}.bam.bai")
    shell:
        "samtools index {input}"

#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
    input:
        bam = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam"),
        bai = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
    params:
        prefix="analysis/BAMDiagnostics/{sample}"
    output:
        "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
        "analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
        "analysis/BAMDiagnostics/{sample}_MQ.txt",
        "analysis/BAMDiagnostics/{sample}_readLength.txt",
        "analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
    message:
        "running BamDiagnostics...{wildcards.sample}"
    shell:
        ".atlas task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"

rule summary:
    input:
        bamd = "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
        index = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
    output:
        str( Path("{path}") / "{sample}.summary.txt")
    shell:
        "echo -e '{input.index} {input.bamd}"

Snakemake - 输入文件中的通配符无法从输出文件中确定

Snakemake - Wildcards in input files cannot be determined from output files

python

pipeline

python-3.x