如何使用 `fromSRA` 函数下载 `Nextflow` 中的 `FastQ` 文件列表？

Question

我有一个包含多个列的 tsv 文件。 run_accession 列是我感兴趣的列之一。它包含各种基因组数据样本的登录 ID。我想在 Nextflow 中编写一个管道，它使用以下命令从该文件中读取登录 ID：

cut -f4 datalist.tsv | sed -n 2,11p

输出：

ERR2512385  
ERR2512386  
ERR2512387  
ERR2512388  
ERR2512389  
ERR2512390  
ERR2512391  
ERR2512392  
ERR2512393  
ERR2512394

并将此 ID 列表提供给 Channel.fromSRA 方法。到目前为止，我已经试过了：

#!/home/someuser/bin nextflow

nextflow.enable.dsl=2

params.datalist = "$baseDir/datalist.tsv"

process fetchRunAccession {
    input:
    path dlist

    output:
    file accessions

    """
    cut -f4 $dlist | sed -n 2,11p
    """
}

process displayResult {
    input:
    file accessions

    output:
    stdout

    """
    echo "$accessions"
    """
}

workflow {
    accessions_p = fetchRunAccession(params.datalist)
    result = displayResult(accessions_p)
    result.view { it }
}

我得到这个错误：

Error executing process > 'fetchRunAccession'

Caused by:
  Missing output file(s) `accessions` expected by process `fetchRunAccession

如果我运行只是第一个过程，它运行良好并按预期打印 10 行。第二个过程只是实际 fromSRA 实现的占位符，但我无法将第一个过程的输出用作第二个过程的输入。我是 Nextflow 的新手，我的代码可能有一些愚蠢的错误。我将不胜感激在这件事上的任何帮助。

Answer 1

fromSRA function is actually a factory method. It requires either a project/study id, or one or more accession numbers, which must be specified as a list. A channel emitting accession numbers (like in your example code) will not work here. Also, it would be better to avoid spawning a separate job/process just to parse a small CSV file. Instead, just let your main Nextflow process do this. There's lots of ways to do this, but for CSV input I find using Nextflow's CsvSplitter class 让这变得简单：

import nextflow.splitter.CsvSplitter

nextflow.enable.dsl=2


def fetchRunAccessions( tsv ) {

    def splitter = new CsvSplitter().options( header:true, sep:'\t' )
    def reader = new BufferedReader( new FileReader( tsv ) )

    splitter.parseHeader( reader )

    List<String> run_accessions = []
    Map<String,String> row

    while( row = splitter.fetchRecord( reader ) ) {

       run_accessions.add( row['run_accession'] )
    }

    return run_accessions
}


workflow {

    accessions = fetchRunAccessions( params.filereport )

    Channel
        .fromSRA( accessions )
        .view()
}

请注意，Nextflow 的 ENA 下载 URL 最近已更新。您需要最新版本的 Nextflow (21.07.0-edge) 才能轻松达到运行：

NXF_VER=21.07.0-edge nextflow run test.nf --filereport filereport.tsv

如何使用 `fromSRA` 函数下载 `Nextflow` 中的 `FastQ` 文件列表？

How to download a list of `FastQ` files in `Nextflow` using `fromSRA` function?

bioinformatics

nextflow