我的合并重叠坐标 python 脚本占用太多内存

My merge overlapping coordinate python script cost too much memory

我正在尝试为我的无创产前检测 (NIPT) 项目选择独特的区域。我已完成以下步骤:

我有一个大约 40gb 的 .sam 文件,下一步我尝试将所有重叠坐标合并到一个 .bed 文件中,以便在 samtools 中使用。这是我的 python 脚本:

import glob
import time

folder = glob.glob('analysys/*.sam')
core = 22
nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
for file in folder:
    now = time.time()
    print(f'Analyzing {file}')
    raw = []
    treat1 = []
    treat2 = []
    with open(file) as samfile:
        print('Importing coordinates')
        for line in samfile:
            coord_raw = line.split('\t')[0]
            coord_chr = coord_raw.split('_')[0]
            coord_loc = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
            raw.append(({coord_chr},{coord_loc[0]},{coord_loc[1]}))
        print('Begin merging overlapping area...')
        last_coord = () #0:chr, 1:begin, 2:end
        for chrom ,begin , end in raw:
            if last_coord == () or last_coord[0] != chrom:
                last_coord = (chrom,begin,end)
            else:
                if last_coord[0] == chrom:
                    if begin > last_coord[1] and end < last_coord[2]:
                        last_coord = (chrom,last_coord[1],end)
                    else:
                        treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                        last_coord = (chrom,begin,end)
                else:
                    treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                    last_coord = (chrom,begin,end)       
        print('Begin merging nearby area...')                         
        last_coord = ()
        for chrom ,begin , end in treat1:
            if last_coord == ():
                last_coord = (chrom,begin,end) 
            else:
                if chrom == last_coord[0]:
                    if begin == last_coord[2] + 1:
                        last_coord = (last_coord[0],last_coord[1],end)
                    else:
                        treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
                        last_coord = (chrom,begin,end)
                else:
                    treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
                    last_coord = (chrom,begin,end)
print('Outputting...')
with open('unique_coord.bed','w') as bedfile:
    bedfile.write('\n'.join(treat2))
print(f'Program completed! Total time: {int(time.time() - now)} seconds')

然而,30 分钟后运行,我发现程序耗尽了我的所有内存并崩溃了。有什么建议可以减少此脚本消耗的内存量吗?非常感谢

编辑:看来我需要将 samfile 拆分到每个染色体并分别完成,然后将所有 bed 文件合并在一起。感谢@ramzeek 提供减少内存使用量的建议

import glob
import time

folder = glob.glob('analysys/*.sam')

now = time.time() 

for file in folder:
    with open(f"{file.split('/')[-1].replace('.sam','.bed')}",'w') as bedfile:
        print(f'Analyzing {file}')
        raw = []
        treat1 = []
        with open(file) as samfile:
            print('Importing coordinates')
            last_coord = ()
            for line in samfile:
                chrom = line.split('\t')[0]
                begin = chrom.split('_')[1].split('-')[0]
                end = chrom.split('_')[1].split('-')[1]
                if last_coord == () or last_coord[0] != chrom:
                    last_coord = (chrom,begin,end)
                else:
                    if last_coord[0] == chrom:
                        if begin > last_coord[1] and end < last_coord[2]:
                            last_coord = (chrom,last_coord[1],end)
                        else:
                            treat1.append(last_coord)
                            last_coord = (chrom,begin,end)
                    else:
                        treat1.append(last_coord)
                        last_coord = (chrom,begin,end)       
            print('Begin merging nearby area...')                         
            last_coord = ()
            for chrom ,begin , end in treat1:
                if last_coord == ():
                    last_coord = (chrom,begin,end) 
                else:
                    if chrom == last_coord[0]:
                        if begin == last_coord[2] + 1:
                            last_coord = (last_coord[0],last_coord[1],end)
                        else:
                            bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
                            last_coord = (chrom,begin,end)
                    else:
                        bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
                        last_coord = (chrom,begin,end)

print(f'Program completed! Total time: {int(time.time() - now)} seconds')

希望这能解决问题。很难知道,因为我在没有输入文件或不知道输出应该是什么样子的情况下工作,并且不能真正 运行 检查错误的代码,但这是我尝试做的事情:

  1. 消除raw,直接创建treat1;和
  2. 消除treat2,直接写入bedfile。我立即打开了 bedfile,您程序中的所有内容都是在打开该文件的情况下完成的。

如果这满足了您的需要但仍然崩溃,那么也许您可以通读每个文件两次以获得 treat1 末尾的 last_coord 然后再次通读以“重新创建" treat1 的每一行单独应用它来定义需要写入文件的内容。

不知道你在做什么的细节(我不在靠近应用 samtools 的领域的任何地方工作)。

import glob
import time

folder = glob.glob('analysys/*.sam')

# These two lines are not used
#core = 22
#nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']

# moved this to outside the for loop since the time check was 
# after the for loop
now = time.time() 

with open('unique_coord.bed','w') as bedfile:
    for file in folder:
        print(f'Analyzing {file}')
        raw = []
        treat1 = []
        with open(file) as samfile:
            print('Importing coordinates')
            last_coord = ()
            for line in samfile:
                chrom = line.split('\t')[0]
                begin = coord_raw.split('_')[0]
                end = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
                if last_coord == () or last_coord[0] != chrom:
                    last_coord = (chrom,begin,end)
                else:
                    if last_coord[0] == chrom:
                        if begin > last_coord[1] and end < last_coord[2]:
                            last_coord = (chrom,last_coord[1],end)
                        else:
                            treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                            last_coord = (chrom,begin,end)
                    else:
                        treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                        last_coord = (chrom,begin,end)       
            print('Begin merging nearby area...')                         
            last_coord = ()
            for chrom ,begin , end in treat1:
                if last_coord == ():
                    last_coord = (chrom,begin,end) 
                else:
                    if chrom == last_coord[0]:
                        if begin == last_coord[2] + 1:
                            last_coord = (last_coord[0],last_coord[1],end)
                        else:
                            bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
                            last_coord = (chrom,begin,end)
                    else:
                        bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
                        last_coord = (chrom,begin,end)

print(f'Program completed! Total time: {int(time.time() - now)} seconds')