使用 Snakemake 将多个文件拆分成多个部分

Question

我正在构建一个管道，它应该将文件列表作为输入（位于磁盘上的任何位置），将所有这些文件拆分成更小的部分，然后在合并之前对所有这些部分进行一些计算结果。我正在努力迈出第一步。

例如
输入文件 = A 和 B
A 和 B 分成 10 个文件：A1, A2, A3, A4 ... B9, B10.
对所有子文件进行一些计算：results_A1、results_A2…results_B10
结果被合并，相对于它们来自的输入文件。所以我们最终得到
results_A_merged 和 results_B_merged

文件分割工具（seqkit split）获取我要分割文件的块数，我要分割的文件，一个输出目录，并在这个输出目录中输出分割后的文件给定的模式。如果输入文件是path/to/file_A.fasta，它会输出：output_dir/file_A.part_001.fasta, output_dir/file_A.part_002.fasta等

我用一个文件作为输入实现了这一点。

my_files="path/to/file_1.fasta"
my_files_dir=[]
my_files_prefix=[]
my_files_extension=[]

###Store the path to the dir, the file name without extension, and the extension.
for i in my_files:
    print(i)
    my_files_dir.append(re.search(r'(.*)/(.*)',i).group(1))
    my_files_prefix.append(re.search(r'(.*)/(.*)(\.[fna|fa|fasta])',i).group(2))
    my_files_extension.append(re.search(r'(.*)/(.*)(\.fna)',i).group(3)) ###FIXME: hard coded shit...

###Create the name of all the splited files
my_temp_fasta=[]    
for i in range(1,blast_jobs):
    my_temp_fasta.append(my_files_prefix[0]+'.part_%03d'%i+my_files_extension[0])

###Split my file.
rule split_fasta:
    input:
        my_files
    output:
        expand('splited_fasta/{tmp_fasta_files}', tmp_fasta_files=my_temp_fasta)
    params:
        num_sequences=10
    shell:
        "seqkit split --out-dir splited_fasta --by-part {params.num_sequences} {input}"

但是当我尝试处理多个文件时，我什至无法正确拆分它们。

这是我的未工作管道，目前只有一个规则来尝试拆分文件。


my_files=["path/to/file_1.fasta", "other/path/to/file_2.fasta"]

my_files_dir=[]
my_files_prefix=[]
my_files_extension=[]

###Store the path to the dir, the file name without extension, and the extension of each files.
for i in my_files:
    print(i)
    my_files_dir.append(re.search(r'(.*)/(.*)',i).group(1))
    my_files_prefix.append(re.search(r'(.*)/(.*)(\.[fna|fa|fasta])',i).group(2))
    my_files_extension.append(re.search(r'(.*)/(.*)(\.fna)',i).group(3)) ###FIXME: hard coded shit...

#Store all the files that will be created by the split command.
tmp=[]
my_temp_fasta_dict={}
for j in range(0,len(my_files)):
    for i in range(1,10):
        tmp.append(my_files_prefix[j]+'.part_%03d'%i+my_files_extension[j])
    my_temp_fasta_dict[my_files_prefix[j]] = tmp
    tmp=[]

##So I have a (useless...) dictionary, with file name prefix as key, and a list of splited file names as values.

rule split_fasta:
    input:
        my_files
    output:
        expand('splited_fasta/{tmp_fasta_files}', tmp_fasta_files=my_temp_fasta_dict.values())
    params:
        num_sequences=10
    shell:
        "seqkit split --out-dir splited_fasta --by-part {params.num_sequences} {input}"

它给出了一个错误的命令，连接了我所有的输入文件：

seqkit split --out-dir splited_fasta --by-part 5 path/to/file_1.fasta other/path/to/file_2.fasta

而不是运行在两个输入文件上执行两次命令。我就是无法成功做到这一点。更糟糕的是，它可能很容易...

在此先感谢您的帮助。

Answer 1

这是 Snakemake 用户开始自下而上推理（从输入文件到目标）的常见错误。试试自上而下的方法（从目标开始，然后考虑构建这个目标需要什么，等等）：

rule all:
    input = expand("results_{sample}_merged", sample=["A", "B"])

rule merge:
    input = expand("output_dir/file_{{sample}}.part_00{n}.fasta", n=range(1,10))
    output = "results_{sample}_merged"

rule split:
    input = "{sample}"
    output = expand("output_dir/file_{{sample}}.part_00{n}.fasta", n=range(1,10))

使用 Snakemake 将多个文件拆分成多个部分

Split multiple files into multiple parts with Snakemake

snakemake