使用 snakemake 根据定义的映射重命名文件

Question

我正在尝试使用 snakemake 下载文件列表，然后根据文件中给定的映射重命名它们。我首先从 {ID_for_download : sample_name} 形式的文件中读取字典，然后将其键列表传递给第一个下载规则（因为下载很费力，我只是使用虚拟脚本生成空文件).对于列表中的每个文件，都会以 {file_1.fastq} 和 {file_2.fastq} 的形式下载两个文件下载这些文件后，我然后使用第二条规则重命名它们 - 在这里我利用能够运行 python 规则中的代码使用 run 关键字。当我使用 -n 标志执行 dry-运行时，一切正常。但是当我做一个实际的运行时，我得到一个形式为

的错误

Job Missing files after 5 seconds [list of files]
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
Removing output files of failed job rename_srafiles_to_samples since they might be corrupted: [list of all files]

发生的事情是创建了一个用于存储我的文件的目录，然后我的文件被“下载”，然后被重命名。然后当它到达最后一个文件时，我收到此错误并且所有内容都被删除。 snakemake 文件如下：

import csv
import os
SRA_MAPPING = read_dictionary() #dictionary read from a file
SRAFILES = list(SRA_MAPPING.keys())[1:] #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
    input:
        expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
        expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
    output:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    shell:
        "bash dummy_download.sh"
rule rename_srafiles_to_samples:
    input:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    output:
        expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
        expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
    run:
        os.chdir(os.getcwd()+r"/raw_samples")
        for file in os.listdir():
                old_name=file[:file.find("_")]
                sample_name=SRA_MAPPING[old_name]
                new_name=file.replace(old_name,sample_name)
                os.rename(file,new_name)

我已经单独尝试运行 download_srafiles 并且成功了。我也分别尝试运行 rename_srafiles_to_samples 并且它有效。但是当我运行这些文件结合在一起时，我得到了错误。为了完整起见，脚本 dummy_download.sh 如下：

#!/bin/bash
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
for file in "${samples[@]}"
do
touch raw_samples/${file}_1.fastq
touch raw_samples/${file}_2.fastq
done

(linker.csv 是一个文件，在一列中有 ID_for_download，在另一列中有 sample_name)

我做错了什么？

编辑：每个用户 dariober，通过 python 的 os 规则 rename_srafiles_to_samples“混淆”snakemake 中的目录更改。 Snakemake 的逻辑是合理的——如果我更改目录以输入 raw_samples，它会尝试在自身中查找 raw_samples，但失败了。为此，我测试了不同的版本。

版本 1

正如 dariober 所解释的那样。重要的代码位：

for file in os.listdir('raw_samples'):
     old_name= file[:file.find("_")]
     sample_name=SRA_MAPPING[old_name]
     new_name= file.replace(old_name,sample_name)
     os.rename('raw_samples/' + file, 'raw_samples/' + new_name)

列出“raw_samples”目录下的文件，然后重命名。重要的是在每次重命名时添加目录前缀 (raw_samples/)。

版本 2

与我原来的 post 相同，但我没有离开工作目录，而是在循环结束时退出它。有效。

os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
     old_name= file[:file.find("_")]
     sample_name=SRA_MAPPING[old_name]
     new_name= file.replace(old_name,sample_name)
     os.rename(file,new_name)
os.chdir("..")

版本 3

与我原来的 post 相同，但我没有修改 run 段中的任何内容，而是修改了 output 以排除文件目录。这意味着我也必须修改我的 rule all。它没有用。代码如下：

rule all:
input:
    expand("{samples}_1.fastq",samples=SAMPLES),
    expand("{samples}_2.fastq",samples=SAMPLES),

rule download_srafiles:
    output:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    shell:
        "touch {output}"

rule rename_srafiles_to_samples:
    input:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    output:
        expand("{samples}_1.fastq",samples=SAMPLES),
        expand("{samples}_2.fastq",samples=SAMPLES)
    run:
        os.chdir(os.getcwd()+r"/raw_samples")
        for file in os.listdir():
             old_name= file[:file.find("_")]
             sample_name=SRA_MAPPING[old_name]
             new_name= file.replace(old_name,sample_name)
             os.rename(file,new_name)

它给出的错误是：

MissingOutputException in line 24
...
Job files missing

文件确实存在。所以我不知道是我在代码中犯了一些错误还是这是一些错误。

结论

我不会说这是 snakemake 的问题。我考虑不周的过程更像是一个问题。回想起来，进入目录搞乱了 snakemake 的 output/input 进程是完全有道理的。如果我想在 snakemake 中使用 os 模块来更改目录，我必须非常小心。进入任何我需要的地方，但最终还是回到我原来的起点。非常感谢 /u/dariober 和 /u/SultanOrazbayev

Answer 1

我认为 snakemake 被 os.chdir 弄糊涂了。您的规则 rename_srafiles_to_samples 创建了正确的文件并且 input/output 命名没问题。但是，由于您已经更改了目录，snakemake 无法找到预期的输出。我不确定我在所有这些方面是否正确，如果是的话，如果它是一个错误......这个版本避免了 os.chdir 并且似乎工作：

import csv
import os

SRA_MAPPING = {'SRR1': 'A', 'SRR2': 'B'}
SRAFILES = list(SRA_MAPPING.keys()) #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names

rule all:
    input:
        expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
        expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),

rule download_srafiles:
    output:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    shell:
        "touch {output}"

rule rename_srafiles_to_samples:
    input:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    output:
        expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
        expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
    run:
        # os.chdir(os.getcwd()+r"/raw_samples")

        for file in os.listdir('raw_samples'):
             old_name= file[:file.find("_")]
             sample_name=SRA_MAPPING[old_name]
             new_name= file.replace(old_name,sample_name)
             os.rename('raw_samples/' + file, 'raw_samples/' + new_name)

（但是，更 snakemake-ish 的解决方案是为 SRR id 设置一个通配符，并为每个 SRR id 执行一次每个规则，基本上避免 expand in download_srafiles和 rename_srafiles_to_samples)

使用 snakemake 根据定义的映射重命名文件

Using snakemake to rename files according to defined mapping

python

snakemake

版本 1

版本 2

版本 3

结论