Snakemake expand 正在生成两个通配符的意外组合

Question

借助之前的 Whosebug responses，我正在使用示例数据框读取文件信息，包括示例和批处理序列文件列表。在我的 rule all 中，我使用 expand 和 zip 来创建目标文件列表；但是，我遇到了一个错误，我在输出中得到了样本和批处理通配符的意外组合。我想知道您是否有关于如何将输出限制为所有规则定义的输出的建议（即这个问题与我如何定义 trim_reads 规则的输出有关吗？）。

例如，带有样本和批次通配符的样本文件部分：

sample,fastq_1,fastq_2,batch,run
L5011,/labs/jandr/walter/tb/data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz,/labs/jandr/walter/tb/data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz,MT02_MTB_2021-10-29,MT02_MTB_2021-10-29

当我运行 snakemake dry 运行 (snakemake -np) 时，我得到以下样本和批次意外组合的示例（指定的批次不正确）：

rule trim_reads:
    input: data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz, data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz
    output: results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_1.fq.gz, results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_2.fq.gz
    log: results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_reads.log
    jobid: 6137
    wildcards: batch=MT01_MtB_Baits-2021-09-17, sample=L5011

mtb/workflow/scripts/trim_reads.sh data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_1.fq.gz results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_2.fq.gz &>> results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_reads.log

非常感谢您的帮助！

samples_df = pd.read_table('config/MT01-04.tsv',sep = ',').set_index("sample", drop=False)
sample_names = list(samples_df['sample'])
batch_names = list(samples_df['batch'])
#print(sample_names)

# fastq1 input function definition
def fq1_from_sample(wildcards):
  return samples_df.loc[wildcards.sample, "fastq_1"]

# fastq2 input function definition
def fq2_from_sample(wildcards):
  return samples_df.loc[wildcards.sample, "fastq_2"]

# Define a rule for running the complete pipeline. 
rule all:
  input:
    trim = expand(['results/{batch}/{samp}/trim/{samp}_trim_1.fq.gz'], zip, samp=sample_names,batch=batch_names)

# Trim reads for quality. 
rule trim_reads:  
  input: 
    p1=fq1_from_sample,
    p2=fq2_from_sample
  output:     
    trim1=temp('results/{batch}/{sample}/trim/{sample}_trim_1.fq.gz'),
    trim2=temp('results/{batch}/{sample}/trim/{sample}_trim_2.fq.gz')
  log: 
    'results/{batch}/{sample}/trim/{sample}_trim_reads.log'
  shell:
    '{config[scripts_dir]}trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'

Answer 1

您提供的示例看起来不错。这是一个显示它工作的最小示例：

# Snakefile
samples = ['1', '2']
batches = ['b1', 'b2']

print(expand(['{sample}/{batch}/{sample}'], zip, sample=samples, batch=batches))

$ snakemake -nq
['1/b1/1', '2/b2/2']

尝试以下操作：

全部更改你的规则，这样你就可以看到扩展输出是什么

trim = expand(['results/{batch}/{samp}/trim/{samp}_trim_1.fq.gz'], zip, samp=sample_names,batch=batch_names)
print(trim)
rule all:
  input: trim

检查其他规则的要求。如果您比发布的内容多运行，请尝试 snakemake --debug-dag 查看哪些规则正在请求哪些输出。 snakemake --reason 也可以方便地检查个别规则。

希望对您有所帮助，如果您仍然卡住，请更新！

Answer 2

感谢@Troy Comi 提供的故障排除帮助。使用 snakemake --debug-dag 进行故障排除后，我发现 follow-up 规则没有在扩展函数中正确使用 'zip'，并且在 trim_reads 规则的下游生成了意外的目标文件.

再次感谢您帮助解决此问题！

Snakemake expand 正在生成两个通配符的意外组合

Snakemake expand is generating unexpected combinations of two wildcards

python

wildcard

snakemake