在列表中查找非零条纹时改进运行时间

Improving runtime when finding nonzero streaks in a list

我编写了一个接受整数列表和值 l 的函数。 return 值是一个元组——第一个元素是包含非零值的整数条纹(长度必须为 l 或更长)的计数,第二个元素是这些条纹的平均长度。我将此函数用于一个生物信息学项目,该项目需要我输入包含数百万个整数的列表。我发现我写的方法太慢了。我怎样才能提高我的程序的效率?

def contigs_values(sequenced_lst, l):
    """
    By splitting at zeros, can count contiguous sequenced sequences
    """
    # a list of lists storing index start and end values of each contig
    contig_indices = []

    start = 0
    end = 1
    while end != len(sequenced_lst):
        if 0 not in sequenced_lst[start:end + 1]:  # only extend window if contiguous
            end += 1  # extend contig window
            # continue
            if end == len(sequenced_lst):
                contig_indices.append([start, end])  # append final contig indices as index list
        else:  # zero is found and contig broken
            if end - start > 1:
                if end - start < l:  # debug test... TODO: no window should be less than R length, L
                    print("MISTAKE")
                contig_indices.append([start, end])  # append contig indices as index list
            start = end  # start a new contig window
            end += 1
    num_contigs = len(contig_indices)
    avg_contig_len = get_average([i[1] - i[0] for i in contig_indices])

    return num_contigs, avg_contig_len

如果我对目标的理解正确,那么您的代码运行如此缓慢的原因是您不断检查列表的长度,因此在 O(n^2)
附近的某处 所以不计算列表的长度,而是求当前0和前一个0的差值
使用枚举相对容易:

def contigs_values(sequenced_lst, l):
    """
    By splitting at zeros, can count contiguous sequenced sequences
    """
    streaks = []

    prev = 0
    for index, val in enumerate(sequenced_lst):
        if val == 0:
            length = index - prev - 1
            if length >= l:
                streaks.append(length)
            prev = index
    else:
        if index - prev >= l:
            streaks.append(index - prev)
    

    num_contigs = len(streaks)
    avg_contig_len = sum(streaks)/num_contigs

    return num_contigs, avg_contig_len