优化替代数据结构以缩短 运行 时间,因为字典很大?

Optimal substitute data structure to improve running time due to huge size of dictionary?

我有一个 python 脚本,我在其中初始化一个包含大约 490 万个键的字典。 Eack 键有一个 24 个元素的列表,我将其初始化为零。我需要解析一个包含大约 970 万行(每行 20 列)的文本文件,并根据与字典键的特定匹配,增加键的适当列表整数。

问题是解析速度非常慢,我的工作被终止了(集群上最多 24 小时 walltime)。要初始化的字典的大小约为 200 Mb,经过一些时间检查,我发现解析 10,000 行大约需要 16 分钟,因此解析整个 970 万行需要大约 242 小时

简而言之,我只需要对字典键的适当值进行计数和递增。是否有 python 字典的替代数据结构可以优化此脚本并在合理的时间内使其成为 运行?

def count_dict_init(file):
    gff_file = open(file, 'r')
    pos_list = []
    for line in gff_file:
       line_list = line.strip().split('\t')
       if line.startswith('chr') and line[0:5] != 'chrmt':
          if line_list[2] == 'CDS':
            leftpos = int(line_list[3])
            rightpos = int(line_list[4])
            for position in range(leftpos - 100, rightpos + 101):
                pos_list.append(position)

    uniq_list = set(pos_list)
    sorted_list = list(uniq_list)
    sorted_list.sort()
    pos_dict = {}
    for pos in sorted_list:
        pos_dict[pos] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '', '']

    print 'Size of count dicitonary is ', sys.getsizeof(pos_dict)    
    return pos_dict

def sam_parser(sam_file, count):
   dict_count = count
   parsed_file = open('Sam_parsed_dict.tab', 'w')
   non_cds_file = open('Non_Cds_file', 'w')
   for line in sam_file:
       if line[0] != '@':
          fields = line.split('\t')
          if len(fields) > 19:
              multi_flag = fields[19].strip()  
              # If the read has more than one alignment then report it as multiple mapping
              if multi_flag != 'NH:i:1':
                  multi_align = 'Y'
              else:
                  multi_align = 'N'
          else:
              multi_align = 'N'
        non_cds = False
        sam_flag = int(fields[1])
        chr_num = fields[2]
        read_length = len(fields[9])
        pos_in_value = (read_length - 27) * 2  #Determines which list position to update
        if 27 <= read_length <= 37:
           if sam_flag == 0:  # Primary alignment on forward strand
               five_prime = int(fields[3])
               if five_prime in dict_count.keys():
                   dict_count[five_prime][pos_in_value] += 1
                   aligner_cis = dict_count[five_prime][22]
                   if aligner_cis == 'Y':
                       continue
                   else:
                       dict_count[five_prime][22] = multi_align
               else:
                  non_cds = True
           if sam_flag == 16:  # On reverse strand
               five_prime = int(fields[3]) + read_length - 1
               if five_prime in dict_count.keys():
                   dict_count[five_prime][pos_in_value + 1] += 1
                   aligner_trans = dict_count[five_prime][23]
                   if aligner_trans == 'Y':
                       continue
                   else:
                       dict_count[five_prime][23] = multi_align
               else:
                  non_cds = True
           if sam_flag == 256:  # Not primary alignment
               five_prime = int(fields[3])
               if five_prime in dict_count.keys():
                  aligner_cis = dict_count[five_prime][22]
                  if aligner_cis == 'Y':
                     continue
                  else:
                     dict_count[five_prime][22] = multi_align
               else:
                  non_cds = True
           if sam_flag == 272:  # Not primary alignment and on reverse strand
               five_prime = int(fields[3]) + read_length - 1
               if five_prime in dict_count.keys():
                  aligner_trans = dict_count[five_prime][23]
                  if aligner_trans == 'Y':
                    continue
                  else:
                    dict_count[five_prime][23] = multi_align
               else:
                  non_cds = True
           if non_cds:
               non_cds_file.write(str(chr_num)+'\t'+str(fields[3])+'\n')

   for pos, counts in dict_count.iteritems():
        parsed_file.write(str(pos)+'\t'+'\t'.join(map(str, counts))+'\n')

   parsed_file.close()
   non_cds_file.close()

if __name__ == "__main__":
   # Parse arguments from commandline
   arguments = parse_arguments()
   GFF = arguments.gfffile
   chrnum = arguments.chrnum
   initial_count_dict = count_dict_init(GFF)
   SAM = open(arguments.inputPath)
   sam_parser(SAM, initial_count_dict)

我认为你的问题是这个表达式:if five_prime in dict_count.keys():

这会为您的字典 (4.9M) 中的每个键创建一个新列表,然后线性遍历它直到找到该键(如果找不到该键,则遍历整个列表)。

由于在字典中查找一个键需要 1 次操作,而在列表中查找它需要 4.9M 次操作,因此您想改用它:if five_prime in dict_count:.

另一件事是您进行的查找比您需要的多了几次。如果在字典中查找在任何方面都是瓶颈,您可以通过每次迭代只查找一次来最小化瓶颈。这是一些示例代码:

           five_prime = int(fields[3])
           record = dict_count.get(five_prime)
           if record is not None:
               record[pos_in_value] += 1
               aligner_cis = record[22]
               if aligner_cis == 'Y':
                   continue
               else:
                   record[22] = multi_align
           else:
              non_cds = True