优化替代数据结构以缩短 运行 时间,因为字典很大?
Optimal substitute data structure to improve running time due to huge size of dictionary?
我有一个 python 脚本,我在其中初始化一个包含大约 490 万个键的字典。 Eack 键有一个 24 个元素的列表,我将其初始化为零。我需要解析一个包含大约 970 万行(每行 20 列)的文本文件,并根据与字典键的特定匹配,增加键的适当列表整数。
问题是解析速度非常慢,我的工作被终止了(集群上最多 24 小时 walltime)。要初始化的字典的大小约为 200 Mb,经过一些时间检查,我发现解析 10,000 行大约需要 16 分钟,因此解析整个 970 万行需要大约 242 小时
简而言之,我只需要对字典键的适当值进行计数和递增。是否有 python 字典的替代数据结构可以优化此脚本并在合理的时间内使其成为 运行?
def count_dict_init(file):
gff_file = open(file, 'r')
pos_list = []
for line in gff_file:
line_list = line.strip().split('\t')
if line.startswith('chr') and line[0:5] != 'chrmt':
if line_list[2] == 'CDS':
leftpos = int(line_list[3])
rightpos = int(line_list[4])
for position in range(leftpos - 100, rightpos + 101):
pos_list.append(position)
uniq_list = set(pos_list)
sorted_list = list(uniq_list)
sorted_list.sort()
pos_dict = {}
for pos in sorted_list:
pos_dict[pos] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '', '']
print 'Size of count dicitonary is ', sys.getsizeof(pos_dict)
return pos_dict
def sam_parser(sam_file, count):
dict_count = count
parsed_file = open('Sam_parsed_dict.tab', 'w')
non_cds_file = open('Non_Cds_file', 'w')
for line in sam_file:
if line[0] != '@':
fields = line.split('\t')
if len(fields) > 19:
multi_flag = fields[19].strip()
# If the read has more than one alignment then report it as multiple mapping
if multi_flag != 'NH:i:1':
multi_align = 'Y'
else:
multi_align = 'N'
else:
multi_align = 'N'
non_cds = False
sam_flag = int(fields[1])
chr_num = fields[2]
read_length = len(fields[9])
pos_in_value = (read_length - 27) * 2 #Determines which list position to update
if 27 <= read_length <= 37:
if sam_flag == 0: # Primary alignment on forward strand
five_prime = int(fields[3])
if five_prime in dict_count.keys():
dict_count[five_prime][pos_in_value] += 1
aligner_cis = dict_count[five_prime][22]
if aligner_cis == 'Y':
continue
else:
dict_count[five_prime][22] = multi_align
else:
non_cds = True
if sam_flag == 16: # On reverse strand
five_prime = int(fields[3]) + read_length - 1
if five_prime in dict_count.keys():
dict_count[five_prime][pos_in_value + 1] += 1
aligner_trans = dict_count[five_prime][23]
if aligner_trans == 'Y':
continue
else:
dict_count[five_prime][23] = multi_align
else:
non_cds = True
if sam_flag == 256: # Not primary alignment
five_prime = int(fields[3])
if five_prime in dict_count.keys():
aligner_cis = dict_count[five_prime][22]
if aligner_cis == 'Y':
continue
else:
dict_count[five_prime][22] = multi_align
else:
non_cds = True
if sam_flag == 272: # Not primary alignment and on reverse strand
five_prime = int(fields[3]) + read_length - 1
if five_prime in dict_count.keys():
aligner_trans = dict_count[five_prime][23]
if aligner_trans == 'Y':
continue
else:
dict_count[five_prime][23] = multi_align
else:
non_cds = True
if non_cds:
non_cds_file.write(str(chr_num)+'\t'+str(fields[3])+'\n')
for pos, counts in dict_count.iteritems():
parsed_file.write(str(pos)+'\t'+'\t'.join(map(str, counts))+'\n')
parsed_file.close()
non_cds_file.close()
if __name__ == "__main__":
# Parse arguments from commandline
arguments = parse_arguments()
GFF = arguments.gfffile
chrnum = arguments.chrnum
initial_count_dict = count_dict_init(GFF)
SAM = open(arguments.inputPath)
sam_parser(SAM, initial_count_dict)
我认为你的问题是这个表达式:if five_prime in dict_count.keys():
这会为您的字典 (4.9M) 中的每个键创建一个新列表,然后线性遍历它直到找到该键(如果找不到该键,则遍历整个列表)。
由于在字典中查找一个键需要 1 次操作,而在列表中查找它需要 4.9M 次操作,因此您想改用它:if five_prime in dict_count:
.
另一件事是您进行的查找比您需要的多了几次。如果在字典中查找在任何方面都是瓶颈,您可以通过每次迭代只查找一次来最小化瓶颈。这是一些示例代码:
five_prime = int(fields[3])
record = dict_count.get(five_prime)
if record is not None:
record[pos_in_value] += 1
aligner_cis = record[22]
if aligner_cis == 'Y':
continue
else:
record[22] = multi_align
else:
non_cds = True
我有一个 python 脚本,我在其中初始化一个包含大约 490 万个键的字典。 Eack 键有一个 24 个元素的列表,我将其初始化为零。我需要解析一个包含大约 970 万行(每行 20 列)的文本文件,并根据与字典键的特定匹配,增加键的适当列表整数。
问题是解析速度非常慢,我的工作被终止了(集群上最多 24 小时 walltime)。要初始化的字典的大小约为 200 Mb,经过一些时间检查,我发现解析 10,000 行大约需要 16 分钟,因此解析整个 970 万行需要大约 242 小时
简而言之,我只需要对字典键的适当值进行计数和递增。是否有 python 字典的替代数据结构可以优化此脚本并在合理的时间内使其成为 运行?
def count_dict_init(file):
gff_file = open(file, 'r')
pos_list = []
for line in gff_file:
line_list = line.strip().split('\t')
if line.startswith('chr') and line[0:5] != 'chrmt':
if line_list[2] == 'CDS':
leftpos = int(line_list[3])
rightpos = int(line_list[4])
for position in range(leftpos - 100, rightpos + 101):
pos_list.append(position)
uniq_list = set(pos_list)
sorted_list = list(uniq_list)
sorted_list.sort()
pos_dict = {}
for pos in sorted_list:
pos_dict[pos] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '', '']
print 'Size of count dicitonary is ', sys.getsizeof(pos_dict)
return pos_dict
def sam_parser(sam_file, count):
dict_count = count
parsed_file = open('Sam_parsed_dict.tab', 'w')
non_cds_file = open('Non_Cds_file', 'w')
for line in sam_file:
if line[0] != '@':
fields = line.split('\t')
if len(fields) > 19:
multi_flag = fields[19].strip()
# If the read has more than one alignment then report it as multiple mapping
if multi_flag != 'NH:i:1':
multi_align = 'Y'
else:
multi_align = 'N'
else:
multi_align = 'N'
non_cds = False
sam_flag = int(fields[1])
chr_num = fields[2]
read_length = len(fields[9])
pos_in_value = (read_length - 27) * 2 #Determines which list position to update
if 27 <= read_length <= 37:
if sam_flag == 0: # Primary alignment on forward strand
five_prime = int(fields[3])
if five_prime in dict_count.keys():
dict_count[five_prime][pos_in_value] += 1
aligner_cis = dict_count[five_prime][22]
if aligner_cis == 'Y':
continue
else:
dict_count[five_prime][22] = multi_align
else:
non_cds = True
if sam_flag == 16: # On reverse strand
five_prime = int(fields[3]) + read_length - 1
if five_prime in dict_count.keys():
dict_count[five_prime][pos_in_value + 1] += 1
aligner_trans = dict_count[five_prime][23]
if aligner_trans == 'Y':
continue
else:
dict_count[five_prime][23] = multi_align
else:
non_cds = True
if sam_flag == 256: # Not primary alignment
five_prime = int(fields[3])
if five_prime in dict_count.keys():
aligner_cis = dict_count[five_prime][22]
if aligner_cis == 'Y':
continue
else:
dict_count[five_prime][22] = multi_align
else:
non_cds = True
if sam_flag == 272: # Not primary alignment and on reverse strand
five_prime = int(fields[3]) + read_length - 1
if five_prime in dict_count.keys():
aligner_trans = dict_count[five_prime][23]
if aligner_trans == 'Y':
continue
else:
dict_count[five_prime][23] = multi_align
else:
non_cds = True
if non_cds:
non_cds_file.write(str(chr_num)+'\t'+str(fields[3])+'\n')
for pos, counts in dict_count.iteritems():
parsed_file.write(str(pos)+'\t'+'\t'.join(map(str, counts))+'\n')
parsed_file.close()
non_cds_file.close()
if __name__ == "__main__":
# Parse arguments from commandline
arguments = parse_arguments()
GFF = arguments.gfffile
chrnum = arguments.chrnum
initial_count_dict = count_dict_init(GFF)
SAM = open(arguments.inputPath)
sam_parser(SAM, initial_count_dict)
我认为你的问题是这个表达式:if five_prime in dict_count.keys():
这会为您的字典 (4.9M) 中的每个键创建一个新列表,然后线性遍历它直到找到该键(如果找不到该键,则遍历整个列表)。
由于在字典中查找一个键需要 1 次操作,而在列表中查找它需要 4.9M 次操作,因此您想改用它:if five_prime in dict_count:
.
另一件事是您进行的查找比您需要的多了几次。如果在字典中查找在任何方面都是瓶颈,您可以通过每次迭代只查找一次来最小化瓶颈。这是一些示例代码:
five_prime = int(fields[3])
record = dict_count.get(five_prime)
if record is not None:
record[pos_in_value] += 1
aligner_cis = record[22]
if aligner_cis == 'Y':
continue
else:
record[22] = multi_align
else:
non_cds = True