Snakemake:如何有效地使用配置文件
Snakemake: How to use config file efficiently
我在 snakemake 中使用以下配置文件格式进行一些序列分析实践(我有大量样本,每个样本包含 2 个 fastq 文件:
samples:
Sample1_XY:
- fastq_files/SRR4356728_1.fastq.gz
- fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
- fastq_files/SRR6257171_1.fastq.gz
- fastq_files/SRR6257171_2.fastq.gz
我在 运行 fastqc 管道的开头使用以下规则来对齐 fastqc 文件:
import os
# read config info into this namespace
configfile: "config.yaml"
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_fastqc.html",
zip="FastQC/{sample}_fastqc.zip"
params: ""
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
output:
"FastQC/fastq_multiqc.html"
params:
log:
"logs/multiqc.log"
wrapper:
"0.21.0/bio/multiqc"
我的问题是 fastqc 规则。
目前,fastqc 规则和 bowtie2 规则都使用两个输入 SRRXXXXXXX_1.fastq.gz
和 SRRXXXXXXX_2.fastq.gz
创建一个输出文件。
我需要 fastq 规则来生成两个文件,每个 fastq.gz
文件一个单独的文件,但我不确定如何从 fastqc 规则输入语句正确索引配置文件,或者如何结合 expand 和 wildcards 命令来解决这个问题。我可以通过在输入语句的末尾添加 [0]
或 [1]
来获得一个单独的 fastq 文件,但不能同时添加 运行 individually/separately.
我一直在努力寻找正确的索引格式来分别访问每个文件。目前的格式是我管理过的唯一允许 snakemake -np
生成工作列表的格式。
如有任何提示,我们将不胜感激。
似乎每个样本都有两个 fastq 文件,它们的命名格式为 ***_1.fastq.gz
和 ***_2.fastq.gz
。在这种情况下,下面的配置和代码将起作用。
config.yaml:
samples:
Sample_A: fastq_files/SRR4356728
Sample_B: fastq_files/SRR6257171
蛇文件:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_{num}_fastqc.html",
zip="FastQC/{sample}_{num}_fastqc.zip"
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.21.0/bio/multiqc"
我在 snakemake 中使用以下配置文件格式进行一些序列分析实践(我有大量样本,每个样本包含 2 个 fastq 文件:
samples:
Sample1_XY:
- fastq_files/SRR4356728_1.fastq.gz
- fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
- fastq_files/SRR6257171_1.fastq.gz
- fastq_files/SRR6257171_2.fastq.gz
我在 运行 fastqc 管道的开头使用以下规则来对齐 fastqc 文件:
import os
# read config info into this namespace
configfile: "config.yaml"
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_fastqc.html",
zip="FastQC/{sample}_fastqc.zip"
params: ""
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
output:
"FastQC/fastq_multiqc.html"
params:
log:
"logs/multiqc.log"
wrapper:
"0.21.0/bio/multiqc"
我的问题是 fastqc 规则。
目前,fastqc 规则和 bowtie2 规则都使用两个输入 SRRXXXXXXX_1.fastq.gz
和 SRRXXXXXXX_2.fastq.gz
创建一个输出文件。
我需要 fastq 规则来生成两个文件,每个 fastq.gz
文件一个单独的文件,但我不确定如何从 fastqc 规则输入语句正确索引配置文件,或者如何结合 expand 和 wildcards 命令来解决这个问题。我可以通过在输入语句的末尾添加 [0]
或 [1]
来获得一个单独的 fastq 文件,但不能同时添加 运行 individually/separately.
我一直在努力寻找正确的索引格式来分别访问每个文件。目前的格式是我管理过的唯一允许 snakemake -np
生成工作列表的格式。
如有任何提示,我们将不胜感激。
似乎每个样本都有两个 fastq 文件,它们的命名格式为 ***_1.fastq.gz
和 ***_2.fastq.gz
。在这种情况下,下面的配置和代码将起作用。
config.yaml:
samples:
Sample_A: fastq_files/SRR4356728
Sample_B: fastq_files/SRR6257171
蛇文件:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_{num}_fastqc.html",
zip="FastQC/{sample}_{num}_fastqc.zip"
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.21.0/bio/multiqc"