接受对 snakemake 规则略有不同的输入(.fq 与 .fq.gz)

Accepting slightly different inputs to snakemake rule (.fq vs .fq.gz)

我是 snakemake 的新手,希望能够获取一对 .fq 文件或一对 .fq.gz 文件,然后 运行 它们通过 trim_galore 得到一对修剪过的 .fq.gz 输出文件。没有给出我所有的 Snakefile,我有下面丑陋的解决方案,我只是复制了规则并更改了输入。什么是更好的解决方案?

#Trim galore paired end trimming rule for unzipped fastqs:
rule trim_galore_unzipped_PE:
    input:
        r1=join(config['fq_in_path'], '{sample}1.fq'),
        r2=join(config['fq_in_path'], '{sample}2.fq'),
    output:
        r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
        r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
    params:
        out_path=config['trim_out_path'],
    conda:
        'envs/biotools.yaml',
    shell:
        'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'

#Trim galore paired end trimming rule for gzipped fastqs:
rule trim_galore_zipped_PE:
    input:
        r1=join(config['fq_in_path'], '{sample}1.fq.gz'),
        r2=join(config['fq_in_path'], '{sample}2.fq.gz'),
    output:
        r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
        r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
    params:
        out_path=config['trim_out_path'],
    conda:
        'envs/biotools.yaml',
    shell: 
        'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'

使用输入函数可能是最好的解决方案,如下所示:

  1. 将通配符传递给输入函数
  2. 使用已知的 YAML 值,使用该示例名称构建理论文件名。
  3. 使用python函数检查哪个文件(技术上的文件后缀)有效
  4. 构建有效文件列表
  5. Return 并解压有效文件列表。

备注:

  • 输入和输出应该有相同的通配符,否则会导致问题
  • 在输入函数中,确保它不能 return 空字符串,因为 Snakemake 将此解释为 "lack of input" 要求,这不是您想要的。
  • 如果采纳这些建议,更新规则名称,我忘记了。

Snakefile:

 configfile: "config.yaml"

 from os.path import join
 from os.path import exists

 rule all:
     input:
         expand("{trim_out_path}/{sample}.{readDirection}.fq.gz",
             trim_out_path=config["trim_out_path"],
             sample=config["sampleList"],
             readDirection=['1','2'])


 def trim_galore_input_determination(wildcards):
     potential_file_path_list = []
     # Cycle through both suffix possibilities:
     for fastqSuffix in [".fq", ".fq.gz"]:

         # Cycle through both read directions
         for readDirection in ['.1','.2']:

             #Build the list for ech suffix
             potential_file_path = config["fq_in_path"] + "/" + wildcards.sample + readDirection + fastqSuffix

             #Check if this file actually exists
             if exists(potential_file_path):

                 #If file is legit, add to list of acceptable files
                 potential_file_path_list.append(potential_file_path)

     # Checking for an empty list
     if len(potential_file_path_list):
         return potential_file_path_list
     else:
         return ["trim_galore_input_determination_FAILURE" + wildcards.sample]

 rule trim_galore_unzipped_PE:
     input:
         unpack(trim_galore_input_determination)
     output:
         expand("{trim_out_path}/{{sample}}.{readDirection}.fq.gz",
             trim_out_path=config["trim_out_path"],
             readDirection=['1','2'])
     params:
         out_path=config['trim_out_path'],
     conda:
         'envs/biotools.yaml',
     shell:
         'trim_galore --gzip -o {params.out_path} --paired {input}'

config.yaml:

fq_in_path: input/fq
trim_out_path: output
sampleList: ["mySample1", "mySample2"]

$树:

|-- [tboyarsk      1540 Sep  6 15:17]  Snakefile
|-- [tboyarsk        82 Sep  6 15:17]  config.yaml
|-- [tboyarsk       512 Sep  6  8:55]  input
|   |-- [tboyarsk       512 Sep  6  8:33]  fq
|   |   |-- [tboyarsk         0 Sep  6  7:50]  mySample1.1.fq
|   |   |-- [tboyarsk         0 Sep  6  8:24]  mySample1.2.fq
|   |   |-- [tboyarsk         0 Sep  6  7:50]  mySample2.1.fq
|   |   `-- [tboyarsk         0 Sep  6  8:24]  mySample2.2.fq
|   `-- [tboyarsk       512 Sep  6  8:55]  fqgz
|       |-- [tboyarsk         0 Sep  6  7:50]  mySample1.1.fq.gz
|       |-- [tboyarsk         0 Sep  6  8:32]  mySample1.2.fq.gz
|       |-- [tboyarsk         0 Sep  6  8:33]  mySample2.1.fq.gz
|       `-- [tboyarsk         0 Sep  6  8:32]  mySample2.2.fq.gz
`-- [tboyarsk       512 Sep  6  7:55]  output

$snakemake -dry (输入: fg)

 rule trim_galore_unzipped_PE:
     input: input/fq/mySample1.1.fq, input/fq/mySample1.2.fq
     output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
     jobid: 1
     wildcards: sample=mySample1


 rule trim_galore_unzipped_PE:
     input: input/fq/mySample2.1.fq, input/fq/mySample2.2.fq
     output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
     jobid: 2
     wildcards: sample=mySample2


 localrule all:
     input: output/mySample1.1.fq.gz, output/mySample2.1.fq.gz, output/mySample1.2.fq.gz, output/   mySample2.2.fq.gz
     jobid: 0

 Job counts:
         count   jobs
         1       all
         2       trim_galore_unzipped_PE
         3

$snakemake -dry (输入: fgqz)

 rule trim_galore_unzipped_PE:
     input: input/fqgz/mySample1.1.fq.gz, input/fqgz/mySample1.2.fq.gz
     output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
     jobid: 1
     wildcards: sample=mySample1


 rule trim_galore_unzipped_PE:
     input: input/fqgz/mySample2.1.fq.gz, input/fqgz/mySample2.2.fq.gz
     output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
     jobid: 2
     wildcards: sample=mySample2


 localrule all:
     input: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz, output/mySample2.1.fq.gz, output/   mySample2.2.fq.gz
     jobid: 0

 Job counts:
         count   jobs
         1       all
         2       trim_galore_unzipped_PE
         3

有一些方法可以让它更通用,但由于您声明并使用 YAML 配置来构建大部分文件名,我将避免在答案中讨论它。只是说这是可能的并且有点鼓励。

“--paired {input}”将展开以提供两个文件。因为 for 循环,1 总是在 2 之前。