Snakefile 如何混合通配符和变量

Question

我想制定一个规则，对于给定数量的线程，并行地将一个目录和格式中的文件转换为另一个目录和格式。路径的某些元素由变量定义，某些元素是通配符。我希望它在 phase 和 sample 和 ext 上使用通配符，但是从 Python 变量中获取 stage、challenge 和 language环境。我希望复制操作将文件带到文件中。我不希望它获取整个文件列表作为输入。我在这里没有使用 expand，因为如果我使用 expand，那么 snakemake 会将整个输入列表作为 {input} 传递，将整个输出列表作为 {output} 到函数，这不是我想要的。这是 Snakefile：

from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

stage = "/media/catskills/interspeech22"
challenge = "openasr21"
language = "farsi"
sample_rate = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         "{stage}/{challenge}_{language}/{phase}/audio/{sample}.{ext}"
     output:
         "{stage}/{challenge}_{language}/{phase}/wav_{sample_rate}/{sample}.wav"
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

当我运行它时，我得到这个错误：

$ snakemake -c16 
Building DAG of jobs...
WildcardError in line 11 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'stage'

snakemake有没有办法做到这一点？

更新：我找到了部分解决方案here，即使用 f 字符串和双卷曲引号模式。

from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         f"{STAGE}/{CHALLENGE}_{LANGUAGE}/{{phase}}/audio/{{sample}}.{{ext}}"
     output:
     f"{STAGE}/{CHALLENGE}_{LANGUAGE}/{{phase}}/wav_{{SAMPLE_RATE}}/{{sample}}.wav"
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

但是通配符与子目录名称不匹配。我仍然收到错误，但有点不同：

$ snakemake -c16 
Building DAG of jobs...
WildcardError in line 11 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'phase'

这导致 :

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.{ext}" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/wav_{SAMPLE_RATE}/{sample}.wav" )
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

但是我还没有完成:

$ snakemake -c16 
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).

我添加 rule all 以尝试更正此问题：

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

sph_results = [x.replace('.sph', '.wav').replace('/audio/', f'/wav_{SAMPLE_RATE}/')
               for x in glob(f"{STAGE}/{CHALLENGE}_{LANGUAGE}/*/audio/*")]

# Resample WAV and SPH files to 16000 kHz WAV

rule all:
     input:
        sph_results

rule sph_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.{ext}" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/wav_{SAMPLE_RATE}/{sample}.sph" )
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

最后它抱怨说要构造的文件还不存在，在尝试调用将构造它们的函数之前：

$ snakemake -c16 
Building DAG of jobs...
MissingInputException in line 15 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Missing input files for rule all:
/media/catskills/interspeech22/openasr21_farsi/dev/wav_16000/MATERIAL_OP2-3S-BUILD_46645_20171106_064534_inLine.wav
/media/catskills/interspeech22/openasr21_farsi/eval/wav_16000/MATERIAL_OP2-3S_77793199_outLine.wav

其中，为了填写示例，函数 copy_sph_to_wav 是：

import os
import librosa
import soundfile as sf

def copy_sph_to_wav(src, dst, sr):
    cmd='/home/catskills/is22/sph2pipe_v2.5/sph2pipe'
    if src[-4:]=='.wav':
        audio,sr1=librosa.load(src, sr=sr)
    else:
        os.system(f"{cmd} -f wav {src} {dst}")
        audio,sr1=librosa.load(dst, sr=sr)

    sf.write(dst, audio, sr)

更新 2： 这导致我们 here，我们修复了 sph_to_wav 规则生成与我们的 OUTPUTS 文件不匹配的输出的一些问题：

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

OUTPUTS = [x.replace('.sph', '.wav').replace('/audio/', f'/wav_{SAMPLE_RATE}/')
           for x in glob(f"{STAGE}/{CHALLENGE}_{LANGUAGE}/*/audio/*")]

# Resample WAV and SPH files to 16000 kHz WAV

rule all:
     input:
         expand("{output}", output=OUTPUTS)

rule sph_to_wav:
     input:
         '/media/catskills/interspeech22/openasr21_farsi/{phase}/audio/{sample}.{ext}'
     output:
         '/media/catskills/interspeech22/openasr21_farsi/{phase}/wav_16000/{sample}.wav'
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

然而，我们仍然得到一个错误，但一个更集中的错误是：

$ snakemake -c16 -p -n  
Building DAG of jobs...
WildcardError in line 19 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'ext'

可能有一条线索与 wildcard_constraints 有关。

更新 3：This answer 说

Each wildcard in the input section shall have a corresponding wildcard (with the same name) in the output section. That is how Snakemake works: when the Snakemake tries to constract the DAG of jobs and finds that it needs a certain file, it looks at the output section for each rule and checks if this rule can produce the required file. This is the way how Snakemake assigns certain values to the wildcard in the output section. Every wildcard in other sections shall match one of the wildcards in the output, and that is how the input gets concrete filenames.

如果那是真的，那么我认为没有 snakemake 解决方案，因为我正在尝试用 .wav 替换 .sph 而且我不想制作 .sph.wav 文件.

Answer 1

试试这个：

rule all:
    input:
        expand("{your_path}.extension", replacements)

rule make_output:
    input: "{input}_{num}.extension"
    output: "{output}_{num}.extension"
    shell:
        copy_sph_to_wav {input} > {output}

Answer 2

我终于明白了。不是 100% 对此感到满意（宁愿在输入中有 .{ext} 但在输出中没有），但这有效，我想它有自己的意义。问题是我的输入目录可以有 .sph 或 .wav 文件，具体取决于数据提供者的变幻莫测，所以我必须为两种意外情况做好准备：

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav
from copy_wav_to_wav import copy_wav_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

OUTPUTS = [x.replace('.sph', '.wav').replace('/audio/', f'/wav_{SAMPLE_RATE}/')
           for x in glob(f"{STAGE}/{CHALLENGE}_{LANGUAGE}/*/audio/*")]

# Resample WAV and SPH files to 16000 kHz WAV

rule all:
     input:
         expand("{output}", output=OUTPUTS)

rule sph_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.sph" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / f"{{phase}}/wav_{SAMPLE_RATE}/{{sample}}.wav" )
     run:
         copy_sph_to_wav(list({input})[0][0], list({output})[0][0], SAMPLE_RATE)

rule wav_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.wav" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / f"{{phase}}/wav_{SAMPLE_RATE}/{{sample}}.wav" )
     run:
         copy_wav_to_wav(list({input})[0][0], list({output})[0][0], SAMPLE_RATE)

我还通过评论我的函数发现 {input} 是一组包含一个元素的列表：

SRC {['/media/catskills/interspeech22/openasr21_farsi/build/audio/MATERIAL_OP2-3S-BUILD_29884_20170907_021506_outLine.sph']}
DST {['/media/catskills/interspeech22/openasr21_farsi/build/wav_16000/MATERIAL_OP2-3S-BUILD_29884_20170907_021506_outLine.wav']}
SR 16000

我什至不知道这是可能的，所以我必须进行这种丑陋的转换 list({input})[0]，我不知道为什么。

无论如何最终圆满是虔诚的愿望，因为snakemake -c16:

Snakefile 如何混合通配符和变量

Snakefile how to mix wildcards and variables

python

snakemake