如何优化具有 7200 万个条目的文件的代码?
How to optimize code for a file with 72 million entries?
我在另一个平台上问了一个问题 (here) - 如果能得到你的意见,我的 Python 代码将在很短的时间内 运行 .目前,一个百万条目的文件已经耗时3个多小时。
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts1 = {}
dicts2 = {}
lst = []
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
#print(record.id)
#print(record.seq)
#looking for the 3 prime adapter
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
#Only record is used to be able to save the all atributes like phred score in the fastq file
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts1[i]=x
#write ids and seq in a dictionary and keep one if there are multiple seqs with miRNA-seq + UMI
for k,v in dicts1.items():
if v not in dicts2.values():
dicts2[k] = v
#making a list
for keys in dicts2:
lst.append(keys)
#print(dicts1)
#print(dicts2)
#print(lst)
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
#based on the saved ids in the list print the entries (miRNA + 3' adapter + UMI)
if record.id in lst:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
#print(record.seq)
#print(miRNAseq.seq)
#print(adapter_seq.seq)
#print(umi_seq.seq)
#print("@"+record.id)
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
QIAseq_UMI_correction()
为什么不对文件进行一次读取、解析和循环?我已将第二个循环的代码移至第一个循环,我是否遗漏了什么?为什么要循环两次?
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts1 = {}
dicts2 = {}
lst = []
sentinel = 100
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
# only for testing
if sentinel < 0:
break
sentinel -= 1
#print(record.id)
#print(record.seq)
#looking for the 3 prime adapter
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
#Only record is used to be able to save the all atributes like phred score in the fastq file
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
if x not in dicts2:
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
dicts1[i]=x
dicts2[x]=i
if __name__ == '__main__':
QIAseq_UMI_correction()
其他建议:
如评论中所述,您可以对主要步骤进行计时,以查看可以缩短时间的地方。例如,查看 timeit。我的建议是 SeqIO.parse
、if "AACTGTAGGCACCATCAAT" in record.seq:
我怀疑大部分时间都花在了 SeqIO.parse 的解析上,所以你最好使用一次。
最后一个建议是使用较小的记录集,直到您的代码准备好满足您的需要。我添加了一个哨兵变量作为示例,以在探索了 100 条匹配记录时跳出循环。
我第一眼看到的是这些:
首先检查
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
您可以使用以下方法避免搜索序列两次。
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
if adapter_pos != -1: # check that sequence is found
行数:
i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts1[i]=x
可改为:
dicts1[record.id]=miRNAseq.seq+umi_seq.seq
下一行:
for k,v in dicts1.items():
if v not in dicts2.values():
dicts2[k] = v
可以改成
dict2 = {**dict1,**dict2}
然而,我不确定这种变化实际上可能会以性能成本为代价。
接下来是
for keys in dicts2:
lst.append(keys)
可以删除并且
lst = list(dicts2.keys())
可以在第一次通过之外和之后添加(在那里你有注释掉的打印语句。)
最后,@Roeften 建议您可以将第二段代码与第一段代码放在一起,以避免两次遍历整个文件。
至少其中一些建议会有所帮助,但实际上 python 并不是一种语言那么快,如果您想定期进行此类分析,您可能会考虑使用更快的语言。
我通过删除第二部分来解决它 - 然后检查 dicts.keys()
中是否不存在 x(搜索 dicts.values()
慢得多,这是真正的瓶颈)然后写 y
在输出中。我现在只使用一个字典,代码已经在生成输出。这是更新后的版本。
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts = {}
lst = []
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
i = record.id
x = miRNAseq.seq+umi_seq.seq
y = miRNAseq.seq+adapter_seq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
if x not in dicts.keys():
dicts[x]=i
if len(y) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
if len(y) > 50:
cut = len(y) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
QIAseq_UMI_correction()
我会在这里使用 partition-method 来回答你自己的问题。
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts = {}
lst = []
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
miRNAseq, adapter_seq, umi_seq = str(record.seq).partition("AACTGTAGGCACCATCAAT")
if adapter_seq:
x = miRNAseq.seq+umi_seq.seq
y = miRNAseq.seq+adapter_seq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts[miRNAseq + umi_seq] = record.id
if len(y) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
else:
cut = len(y) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
QIAseq_UMI_correction()
我在另一个平台上问了一个问题 (here) - 如果能得到你的意见,我的 Python 代码将在很短的时间内 运行 .目前,一个百万条目的文件已经耗时3个多小时。
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts1 = {}
dicts2 = {}
lst = []
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
#print(record.id)
#print(record.seq)
#looking for the 3 prime adapter
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
#Only record is used to be able to save the all atributes like phred score in the fastq file
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts1[i]=x
#write ids and seq in a dictionary and keep one if there are multiple seqs with miRNA-seq + UMI
for k,v in dicts1.items():
if v not in dicts2.values():
dicts2[k] = v
#making a list
for keys in dicts2:
lst.append(keys)
#print(dicts1)
#print(dicts2)
#print(lst)
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
#based on the saved ids in the list print the entries (miRNA + 3' adapter + UMI)
if record.id in lst:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
#print(record.seq)
#print(miRNAseq.seq)
#print(adapter_seq.seq)
#print(umi_seq.seq)
#print("@"+record.id)
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
QIAseq_UMI_correction()
为什么不对文件进行一次读取、解析和循环?我已将第二个循环的代码移至第一个循环,我是否遗漏了什么?为什么要循环两次?
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts1 = {}
dicts2 = {}
lst = []
sentinel = 100
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
# only for testing
if sentinel < 0:
break
sentinel -= 1
#print(record.id)
#print(record.seq)
#looking for the 3 prime adapter
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
#Only record is used to be able to save the all atributes like phred score in the fastq file
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
if x not in dicts2:
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
dicts1[i]=x
dicts2[x]=i
if __name__ == '__main__':
QIAseq_UMI_correction()
其他建议:
如评论中所述,您可以对主要步骤进行计时,以查看可以缩短时间的地方。例如,查看 timeit。我的建议是 SeqIO.parse
、if "AACTGTAGGCACCATCAAT" in record.seq:
我怀疑大部分时间都花在了 SeqIO.parse 的解析上,所以你最好使用一次。
最后一个建议是使用较小的记录集,直到您的代码准备好满足您的需要。我添加了一个哨兵变量作为示例,以在探索了 100 条匹配记录时跳出循环。
我第一眼看到的是这些:
首先检查
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
您可以使用以下方法避免搜索序列两次。
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
if adapter_pos != -1: # check that sequence is found
行数:
i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts1[i]=x
可改为:
dicts1[record.id]=miRNAseq.seq+umi_seq.seq
下一行:
for k,v in dicts1.items():
if v not in dicts2.values():
dicts2[k] = v
可以改成
dict2 = {**dict1,**dict2}
然而,我不确定这种变化实际上可能会以性能成本为代价。
接下来是
for keys in dicts2:
lst.append(keys)
可以删除并且
lst = list(dicts2.keys())
可以在第一次通过之外和之后添加(在那里你有注释掉的打印语句。)
最后,@Roeften 建议您可以将第二段代码与第一段代码放在一起,以避免两次遍历整个文件。
至少其中一些建议会有所帮助,但实际上 python 并不是一种语言那么快,如果您想定期进行此类分析,您可能会考虑使用更快的语言。
我通过删除第二部分来解决它 - 然后检查 dicts.keys()
中是否不存在 x(搜索 dicts.values()
慢得多,这是真正的瓶颈)然后写 y
在输出中。我现在只使用一个字典,代码已经在生成输出。这是更新后的版本。
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts = {}
lst = []
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
i = record.id
x = miRNAseq.seq+umi_seq.seq
y = miRNAseq.seq+adapter_seq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
if x not in dicts.keys():
dicts[x]=i
if len(y) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
if len(y) > 50:
cut = len(y) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
QIAseq_UMI_correction()
我会在这里使用 partition-method 来回答你自己的问题。
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts = {}
lst = []
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
miRNAseq, adapter_seq, umi_seq = str(record.seq).partition("AACTGTAGGCACCATCAAT")
if adapter_seq:
x = miRNAseq.seq+umi_seq.seq
y = miRNAseq.seq+adapter_seq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts[miRNAseq + umi_seq] = record.id
if len(y) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
else:
cut = len(y) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
QIAseq_UMI_correction()