snakemake shell 脚本中的输入组合
Combination of inputs in the snakemake shell script
我有以下 bash 脚本,我想将其转换为 snakefile:
mmseqs rbh flye_db megahit_db flye_megahit_rbh --min-seq-id 0.9 mmseq2_tmp --threads 12
mmseqs rbh flye_db metaspades_db flye_metaspades_rbh --min-seq-id 0.9 mmseq2_tmp --threads 12
mmseqs rbh megahit_db metaspades_db megahit_metaspades_rbh --min-seq-id 0.9 mmseq2_tmp --threads 12
我已经设法想出以下内容,但想知道是否有办法使用正则表达式或扩展以进一步改进代码:
rule mmseq2_compare:
input:
mm1=expand(os.path.join(RESULTS_DIR, "annotation/mmseq2/{assembler}_db"), assembler="flye"),
mm2=expand(os.path.join(RESULTS_DIR, "annotation/mmseq2/{assembler}_db"), assembler="megahit"),
mm3=expand(os.path.join(RESULTS_DIR, "annotation/mmseq2/{assembler}_db"), assembler="metaspades_hybrid")
output:
mo1=os.path.join(RESULTS_DIR, "annotation/mmseq2/flye_megahit_rbh"),
mo2=os.path.join(RESULTS_DIR, "annotation/mmseq2/flye_metaspades_hybrid_rbh"),
mo3=os.path.join(RESULTS_DIR, "annotation/mmseq2/megahit_metaspades_hybrid_rbh")
log: os.path.join(RESULTS_DIR, "annotation/mmseq2/compare.mmseq2.log")
conda: "cd-hit.yml"
shell:
"""
(date &&\
mmseqs rbh {input.mm1} {input.mm2} {output.mo1} --min-seq-id 0.9 mmseq2_tmp --threads 12 &&\
mmseqs rbh {input.mm1} {input.mm3} {output.mo2} --min-seq-id 0.9 mmseq2_tmp --threads 12 &&\
mmseqs rbh {input.mm2} {input.mm3} {output.mo3} --min-seq-id 0.9 mmseq2_tmp --threads 12 &&\
date) &> >(tee {log})
"""
使用 3 个汇编程序(flye、megahit 和 metaspades_hybrid)是否有任何方法可以消除冗余,尤其是在 'shell' 中?
谢谢!
干-运行输出
Building DAG of jobs...
Job counts:
count jobs
1 all
1 mmseq_compare
2
[Thu Mar 26 12:25:14 2020]
rule mmseq_compare:
input: results/annotation/mmseq2/flye_db, results/annotation/mmseq2/megahit_db
output: results/annotation/mmseq2/flye_megahit_rbh
jobid: 1
wildcards: assembler1=flye, assembler2=megahit
mmseqs rbh results/annotation/mmseq2/flye_db results/annotation/mmseq2/megahit_db results/annotation/mmseq2/flye_megahit_rbh --min-seq-id 0.9 mmseq2_tmp --threads 12
[Thu Mar 26 12:25:14 2020]
localrule all:
input: results/annotation/mmseq2/flye_megahit_rbh, results/annotation/mmseq2/flye_metaspades_hybrid_rbh, results/annotation/mmseq2/megahit_metaspades_hybrid_rbh
jobid: 0
Job counts:
count jobs
1 all
1 mmseq_compare
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.```
首先,您的输入中不需要 expand
。如果您希望创建具有相同模式的文件名列表,则需要这样做。
接下来,只要您已经在路径中使用了 Unix 类型的斜杠,就可以将 RESULTS_DIR
添加到 f 字符串中以提高可读性(但不要忘记将通配符的大括号加倍)。
最后,不需要用 && 分隔脚本管道:这就是 Snakemake 的设计目的。
我修改后的脚本版本:
rule all:
input:
expand(f"{RESULTS_DIR}/annotation/mmseq2/{{assembler1}}__{{assembler2}}_rbh", zip,
assembler1=["flye", "flye", "megahit"],
assembler2=["megahit", "megahit_metaspades_hybrid", "megahit_metaspades_hybrid"])
rule mmseq_compare:
input:
f"{RESULTS_DIR}/annotation/mmseq2/{{assembler1}}_db",
f"{RESULTS_DIR}/annotation/mmseq2/{{assembler2}}_db"
output:
f"{RESULTS_DIR}/annotation/mmseq2/{{assembler1}}__{{assembler2}}_rbh"
conda:
"cd-hit.yml"
shell:
"mmseqs rbh {input[0]} {input[1]} {output} --min-seq-id 0.9 mmseq2_tmp --threads 12"
我排除了 date
和日志记录。我的解决方案有一个限制,即比较的执行顺序未定义:在这种情况下,您需要重新考虑日志记录策略。
我有以下 bash 脚本,我想将其转换为 snakefile:
mmseqs rbh flye_db megahit_db flye_megahit_rbh --min-seq-id 0.9 mmseq2_tmp --threads 12
mmseqs rbh flye_db metaspades_db flye_metaspades_rbh --min-seq-id 0.9 mmseq2_tmp --threads 12
mmseqs rbh megahit_db metaspades_db megahit_metaspades_rbh --min-seq-id 0.9 mmseq2_tmp --threads 12
我已经设法想出以下内容,但想知道是否有办法使用正则表达式或扩展以进一步改进代码:
rule mmseq2_compare:
input:
mm1=expand(os.path.join(RESULTS_DIR, "annotation/mmseq2/{assembler}_db"), assembler="flye"),
mm2=expand(os.path.join(RESULTS_DIR, "annotation/mmseq2/{assembler}_db"), assembler="megahit"),
mm3=expand(os.path.join(RESULTS_DIR, "annotation/mmseq2/{assembler}_db"), assembler="metaspades_hybrid")
output:
mo1=os.path.join(RESULTS_DIR, "annotation/mmseq2/flye_megahit_rbh"),
mo2=os.path.join(RESULTS_DIR, "annotation/mmseq2/flye_metaspades_hybrid_rbh"),
mo3=os.path.join(RESULTS_DIR, "annotation/mmseq2/megahit_metaspades_hybrid_rbh")
log: os.path.join(RESULTS_DIR, "annotation/mmseq2/compare.mmseq2.log")
conda: "cd-hit.yml"
shell:
"""
(date &&\
mmseqs rbh {input.mm1} {input.mm2} {output.mo1} --min-seq-id 0.9 mmseq2_tmp --threads 12 &&\
mmseqs rbh {input.mm1} {input.mm3} {output.mo2} --min-seq-id 0.9 mmseq2_tmp --threads 12 &&\
mmseqs rbh {input.mm2} {input.mm3} {output.mo3} --min-seq-id 0.9 mmseq2_tmp --threads 12 &&\
date) &> >(tee {log})
"""
使用 3 个汇编程序(flye、megahit 和 metaspades_hybrid)是否有任何方法可以消除冗余,尤其是在 'shell' 中?
谢谢!
干-运行输出
Building DAG of jobs...
Job counts:
count jobs
1 all
1 mmseq_compare
2
[Thu Mar 26 12:25:14 2020]
rule mmseq_compare:
input: results/annotation/mmseq2/flye_db, results/annotation/mmseq2/megahit_db
output: results/annotation/mmseq2/flye_megahit_rbh
jobid: 1
wildcards: assembler1=flye, assembler2=megahit
mmseqs rbh results/annotation/mmseq2/flye_db results/annotation/mmseq2/megahit_db results/annotation/mmseq2/flye_megahit_rbh --min-seq-id 0.9 mmseq2_tmp --threads 12
[Thu Mar 26 12:25:14 2020]
localrule all:
input: results/annotation/mmseq2/flye_megahit_rbh, results/annotation/mmseq2/flye_metaspades_hybrid_rbh, results/annotation/mmseq2/megahit_metaspades_hybrid_rbh
jobid: 0
Job counts:
count jobs
1 all
1 mmseq_compare
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.```
首先,您的输入中不需要 expand
。如果您希望创建具有相同模式的文件名列表,则需要这样做。
接下来,只要您已经在路径中使用了 Unix 类型的斜杠,就可以将 RESULTS_DIR
添加到 f 字符串中以提高可读性(但不要忘记将通配符的大括号加倍)。
最后,不需要用 && 分隔脚本管道:这就是 Snakemake 的设计目的。
我修改后的脚本版本:
rule all:
input:
expand(f"{RESULTS_DIR}/annotation/mmseq2/{{assembler1}}__{{assembler2}}_rbh", zip,
assembler1=["flye", "flye", "megahit"],
assembler2=["megahit", "megahit_metaspades_hybrid", "megahit_metaspades_hybrid"])
rule mmseq_compare:
input:
f"{RESULTS_DIR}/annotation/mmseq2/{{assembler1}}_db",
f"{RESULTS_DIR}/annotation/mmseq2/{{assembler2}}_db"
output:
f"{RESULTS_DIR}/annotation/mmseq2/{{assembler1}}__{{assembler2}}_rbh"
conda:
"cd-hit.yml"
shell:
"mmseqs rbh {input[0]} {input[1]} {output} --min-seq-id 0.9 mmseq2_tmp --threads 12"
我排除了 date
和日志记录。我的解决方案有一个限制,即比较的执行顺序未定义:在这种情况下,您需要重新考虑日志记录策略。