改进嵌套循环以提高效率

Question

我正在从事一个关于 PSL 文件分析的项目。该程序总体上着眼于 readpairs 并识别环状分子。我的程序可以正常运行，但由于我的操作是嵌套的，因此读取整个 PSL 文件需要超过 10 分钟的时间，而不是应该的 ~15 秒，因此效率非常低下。

相关代码为：

def readPSLpairs(self):

    posread = []
    negread = []
    result = {}
    for psl in self.readPSL():
        parsed = psl.split()
        strand = parsed[9][-1]
        if strand == '1':
            posread.append(parsed)
        elif strand == '2':
            negread.append(parsed)

    for read in posread:
        posname = read[9][:-2]
        poscontig = read[13]
        for read in negread:
            negname = read[9][:-2]
            negcontig = read[13]
            if posname == negname and poscontig == negcontig:
                try:
                    result[poscontig] += 1
                    break
                except:
                    result[poscontig] = 1
                    break
    print(result)

我曾尝试更改整体操作，而不是将值附加到列表，然后尝试匹配 posname = negname 和 poscontig = negcontig，但事实证明它比我想象的要难得多，所以我被卡住了试图改进这一切的功能。

Answer 1

import collections

all_dict = {"pos": collections.defaultdict(int),
            "neg": collections.defaultdict(int)}

result = {}

for psl in self.readPSL():
    parsed = pls.split()
    strand = "pos" if parsed[9][-1]=='1' else "neg"
    name, contig = parsed[9][:-2], parsed[13]
    all_dict[strand][(name,contig)] += 1
# pre-process all the psl's into all_dict['pos'] or all_dict['neg']
#   this is basically just a `collections.Counter` of what you're doing already!

for info, posqty in all_dict['pos'].items():
    negqty = all_dict['neg'][info]  # (defaults to zero)
    result[info] = qty * other_qty
# process all the 'pos' psl's. For every match with a 'neg', set
#   result[(name, contig)] to the total (posqty * negqty)

请注意，这将丢弃整个已解析的 psl 值，仅保留 name 和 contig 切片。

改进嵌套循环以提高效率

Improving a nested loop for efficiency

python

bioinformatics

python-3.x