我的合并重叠坐标 python 脚本占用太多内存
My merge overlapping coordinate python script cost too much memory
我正在尝试为我的无创产前检测 (NIPT) 项目选择独特的区域。我已完成以下步骤:
- 创建一个包含50bp序列的人工fasta文件。在每条染色体上,下一个序列与前一个序列重叠 40bp
- 对齐且只选择无错配序列
我有一个大约 40gb 的 .sam 文件,下一步我尝试将所有重叠坐标合并到一个 .bed 文件中,以便在 samtools 中使用。这是我的 python 脚本:
import glob
import time
folder = glob.glob('analysys/*.sam')
core = 22
nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
for file in folder:
now = time.time()
print(f'Analyzing {file}')
raw = []
treat1 = []
treat2 = []
with open(file) as samfile:
print('Importing coordinates')
for line in samfile:
coord_raw = line.split('\t')[0]
coord_chr = coord_raw.split('_')[0]
coord_loc = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
raw.append(({coord_chr},{coord_loc[0]},{coord_loc[1]}))
print('Begin merging overlapping area...')
last_coord = () #0:chr, 1:begin, 2:end
for chrom ,begin , end in raw:
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
last_coord = (chrom,begin,end)
else:
treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
last_coord = (chrom,begin,end)
print('Outputting...')
with open('unique_coord.bed','w') as bedfile:
bedfile.write('\n'.join(treat2))
print(f'Program completed! Total time: {int(time.time() - now)} seconds')
然而,30 分钟后运行,我发现程序耗尽了我的所有内存并崩溃了。有什么建议可以减少此脚本消耗的内存量吗?非常感谢
编辑:看来我需要将 samfile 拆分到每个染色体并分别完成,然后将所有 bed 文件合并在一起。感谢@ramzeek 提供减少内存使用量的建议
import glob
import time
folder = glob.glob('analysys/*.sam')
now = time.time()
for file in folder:
with open(f"{file.split('/')[-1].replace('.sam','.bed')}",'w') as bedfile:
print(f'Analyzing {file}')
raw = []
treat1 = []
with open(file) as samfile:
print('Importing coordinates')
last_coord = ()
for line in samfile:
chrom = line.split('\t')[0]
begin = chrom.split('_')[1].split('-')[0]
end = chrom.split('_')[1].split('-')[1]
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(last_coord)
last_coord = (chrom,begin,end)
else:
treat1.append(last_coord)
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
print(f'Program completed! Total time: {int(time.time() - now)} seconds')
希望这能解决问题。很难知道,因为我在没有输入文件或不知道输出应该是什么样子的情况下工作,并且不能真正 运行 检查错误的代码,但这是我尝试做的事情:
- 消除
raw
,直接创建treat1
;和
- 消除
treat2
,直接写入bedfile
。我立即打开了 bedfile
,您程序中的所有内容都是在打开该文件的情况下完成的。
如果这满足了您的需要但仍然崩溃,那么也许您可以通读每个文件两次以获得 treat1
末尾的 last_coord
然后再次通读以“重新创建" treat1
的每一行单独应用它来定义需要写入文件的内容。
不知道你在做什么的细节(我不在靠近应用 samtools 的领域的任何地方工作)。
import glob
import time
folder = glob.glob('analysys/*.sam')
# These two lines are not used
#core = 22
#nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
# moved this to outside the for loop since the time check was
# after the for loop
now = time.time()
with open('unique_coord.bed','w') as bedfile:
for file in folder:
print(f'Analyzing {file}')
raw = []
treat1 = []
with open(file) as samfile:
print('Importing coordinates')
last_coord = ()
for line in samfile:
chrom = line.split('\t')[0]
begin = coord_raw.split('_')[0]
end = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
print(f'Program completed! Total time: {int(time.time() - now)} seconds')
我正在尝试为我的无创产前检测 (NIPT) 项目选择独特的区域。我已完成以下步骤:
- 创建一个包含50bp序列的人工fasta文件。在每条染色体上,下一个序列与前一个序列重叠 40bp
- 对齐且只选择无错配序列
我有一个大约 40gb 的 .sam 文件,下一步我尝试将所有重叠坐标合并到一个 .bed 文件中,以便在 samtools 中使用。这是我的 python 脚本:
import glob
import time
folder = glob.glob('analysys/*.sam')
core = 22
nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
for file in folder:
now = time.time()
print(f'Analyzing {file}')
raw = []
treat1 = []
treat2 = []
with open(file) as samfile:
print('Importing coordinates')
for line in samfile:
coord_raw = line.split('\t')[0]
coord_chr = coord_raw.split('_')[0]
coord_loc = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
raw.append(({coord_chr},{coord_loc[0]},{coord_loc[1]}))
print('Begin merging overlapping area...')
last_coord = () #0:chr, 1:begin, 2:end
for chrom ,begin , end in raw:
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
last_coord = (chrom,begin,end)
else:
treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
last_coord = (chrom,begin,end)
print('Outputting...')
with open('unique_coord.bed','w') as bedfile:
bedfile.write('\n'.join(treat2))
print(f'Program completed! Total time: {int(time.time() - now)} seconds')
然而,30 分钟后运行,我发现程序耗尽了我的所有内存并崩溃了。有什么建议可以减少此脚本消耗的内存量吗?非常感谢
编辑:看来我需要将 samfile 拆分到每个染色体并分别完成,然后将所有 bed 文件合并在一起。感谢@ramzeek 提供减少内存使用量的建议
import glob
import time
folder = glob.glob('analysys/*.sam')
now = time.time()
for file in folder:
with open(f"{file.split('/')[-1].replace('.sam','.bed')}",'w') as bedfile:
print(f'Analyzing {file}')
raw = []
treat1 = []
with open(file) as samfile:
print('Importing coordinates')
last_coord = ()
for line in samfile:
chrom = line.split('\t')[0]
begin = chrom.split('_')[1].split('-')[0]
end = chrom.split('_')[1].split('-')[1]
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(last_coord)
last_coord = (chrom,begin,end)
else:
treat1.append(last_coord)
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
print(f'Program completed! Total time: {int(time.time() - now)} seconds')
希望这能解决问题。很难知道,因为我在没有输入文件或不知道输出应该是什么样子的情况下工作,并且不能真正 运行 检查错误的代码,但这是我尝试做的事情:
- 消除
raw
,直接创建treat1
;和 - 消除
treat2
,直接写入bedfile
。我立即打开了bedfile
,您程序中的所有内容都是在打开该文件的情况下完成的。
如果这满足了您的需要但仍然崩溃,那么也许您可以通读每个文件两次以获得 treat1
末尾的 last_coord
然后再次通读以“重新创建" treat1
的每一行单独应用它来定义需要写入文件的内容。
不知道你在做什么的细节(我不在靠近应用 samtools 的领域的任何地方工作)。
import glob
import time
folder = glob.glob('analysys/*.sam')
# These two lines are not used
#core = 22
#nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
# moved this to outside the for loop since the time check was
# after the for loop
now = time.time()
with open('unique_coord.bed','w') as bedfile:
for file in folder:
print(f'Analyzing {file}')
raw = []
treat1 = []
with open(file) as samfile:
print('Importing coordinates')
last_coord = ()
for line in samfile:
chrom = line.split('\t')[0]
begin = coord_raw.split('_')[0]
end = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
print(f'Program completed! Total time: {int(time.time() - now)} seconds')