从 nextflow fromPath() 中提取样本 ID

Question

我是 nextflow 的新手，这是我想在实际工作中测试的一种做法。

#!/usr/bin/env nextflow

params.cns = '/data1/deliver/phase2/CNVkit/*.cns'
cns_ch = Channel.fromPath(params.cns)
cns_ch.view()

这个脚本的输出是：

N E X T F L O W  ~  version 21.04.0
Launching `cnvkit_call.nf` [festering_wescoff] - revision: 886ab3cf13
/data1/deliver/phase2/CNVkit/002-002_L4_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/015-002_L4.SSHT89_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/004-005_L1_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/018-008_L1.SSHT31_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/003-002_L3_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/002-004_L6_sorted_dedup.cns

此处002-002、015-002、004-005等为样本id。我正在尝试编写一个简单的过程来输出诸如 ${sample.id}_sorted_dedup.calls.cns 之类的文件，但我不确定如何提取这些 ID 并将其输出。

process cnvcalls {
    input:
    file(cns_file) from cns_ch

    output:
    file("${sample.id}_sorted_dedup.calls.cns") into cnscalls_ch

    script:
    """
    cnvkit.py call ${cns_file} -o ${sample.id}_sorted_dedup.calls.cns
    """
}

如何修改 process cnvcalls 使其与 sample.id 一起使用？

Answer 1

有很多方法可以从文件名中提取示例 names/ids。一种方法是在下划线处拆分并取第一个元素：

params.cns = '/data1/deliver/phase2/CNVkit/*.cns'
cns_ch = Channel.fromPath(params.cns)


process cnvcalls {

    input:
    path(cns_file) from cns_ch

    output:
    path("${sample_id}_sorted_dedup.calls.cns") into cnscalls_ch

    script:
    sample_id = cns_file.name.split('_')[0]

    """
    cnvkit.py call "${cns_file}" -o "${sample_id}_sorted_dedup.calls.cns"
    """
}

不过，我更喜欢使用元组在输入文件旁边输入样本 name/id：

params.cns = '/data1/deliver/phase2/CNVkit/*.cns'
cns_ch = Channel.fromPath(params.cns).map {
    tuple( it.name.split('_')[0], it )
}


process cnvcalls {

    input:
    tuple val(sample_id), path(cns_file) from cns_ch

    output:
    path "${sample_id}_sorted_dedup.calls.cns" into cnscalls_ch

    """
    cnvkit.py call "${cns_file}" -o "${sample_id}_sorted_dedup.calls.cns"
    """
}

从 nextflow fromPath() 中提取样本 ID

Extract sample ids from nextflow fromPath()

nextflow