Snakemake

Question

在snakemake中，你可以像这样调用外部脚本：

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    script:
        "path/to/script.R"

这样可以方便地访问 R 脚本中名为 snakemake 的 S4 对象。现在在我的例子中，我在 SLURM 集群上运行ning snakemake，我需要在执行 Rscript 之前用 module load R/3.6.0 加载 R，否则作业将 return:

/usr/bin/bash: Rscript: command not found

我如何告诉 snakemake 这样做？如果我运行将规则作为 shell 而不是脚本，不幸的是我的 R 脚本无法访问 snakemake 对象，因此这不是理想的解决方案：

shell:
    "module load R/3.6.0;"
    "Rscript path/to/script.R"

Answer 1

您不能使用 script 标签调用 shell 命令。您肯定必须使用 shell 标签。您始终可以将输入和输出添加为参数：

rule NAME:
    input:
        in1="path/to/inputfile",
        in2="path/to/other/inputfile"
    output:
        out1="path/to/outputfile",
        out2="path/to/another/outputfile"
    shell:
        """
        module load R/3.6.0
        Rscript path/to/script.R {input.in1} {input.in2} {output.out1} {output.out2}
        """

并在 R 脚本中获取参数：

args=commandArgs(trailingOnly=TRUE)
inFile1=args[1]
inFile2=args[2]
outFile1=args[3]
outFile2=args[4]

conda环境的使用：

您可以指定用于特定规则的 conda 环境：

rule NAME:
    input:
        in1="path/to/inputfile",
        in2="path/to/other/inputfile"
    output:
        out1="path/to/outputfile",
        out2="path/to/another/outputfile"
    conda: "r.yml"
    script:
        "path/to/script.R"

在你的 r.yml 文件中：

name: rEnv
channels:
  - r
dependencies:
  - r-base=3.6

然后当你运行 snakemake:

snakemake .... --use-conda

Snakemake 将在运行ning 之前安装所有环境，并且每个环境都将在发送到 slurm 的作业中激活。

Answer 2

如果您关心的是在 Rscript 命令中按名称调用参数，您可能有这样的东西（基本上是 Eric 的回答的扩展）：

rule NAME:
    input:
        in1="path/to/inputfile",
        in2="path/to/other/inputfile"
    output:
        out1="path/to/outputfile",
        out2="path/to/another/outputfile"
    shell:
        r"""
        module load R/3.6.0
        Rscript path/to/script.R \
            inFile1={input.in1} inFile2={input.in2} \
            outFile1={output.out1} outFile2={output.out2}
        """

然后在 script.R 中通过解析命令行访问每个参数：

args <- commandArgs(trailingOnly= TRUE)

for(x in args){
    if(grepl('^inFile1=', x)){
        inFile1 <- sub('^inFile1=', '', x)
    }
    else if(grepl('^inFile2=', x)){
        inFile2 <- sub('^inFile2=', '', x)
    }
    else if(grepl('^outFile1=', x)){
        outFile1 <- sub('^outFile1=', '', x)
    }
    else if(grepl('^outFile2=', x)){
        outFile2 <- sub('^outFile2=', '', x)
    }
    else {
        stop(sprintf('Unrecognized argument %s', x))
    }
}
# Do stuff with inFile1, inFile2, etc...

还要考虑一些专为解析命令行而设计的库，我自己对 argparse for R

非常满意

Answer 3

也许你正在寻找 envmodules，它是激活集群模块的 snakemake 指令，就像模块加载一样；

rule your_rule:
    input:
    output:
    envmodules:
        "R/3.6.0"
    shell:
        "some Rscript"

Snakemake - 在调用外部脚本之前加载集群模块

Snakemake - load cluster modules before an external script is called

python

r

cluster-computing