优化并行函数中的多个列表遍历

Question

我写了一个函数来遍历多个 DNA 序列列表，并将它们的同源性绘制成相同碱基含量的一小部分，这样人们就可以很容易地看到跨多个物种的高同源性区域与低同源性区域。

为此，我从比对文件（在本例中为 clustal）传递了一个由 m 个序列列表组成的 numpy 数组：

    [['-' '-' '-' ... 'C' 'C' 'T']
     ['-' '-' '-' ... '-' '-' '-']
     ['-' '-' '-' ... '-' '-' '-']
     ['G' 'C' 'T' ... '-' '-' '-']]

然后我遍历每个列表并跟踪每列的值。如果每列的值是 A T C 或 G，我将它们添加到字典中并忽略所有其他情况（N 或 -）。不关心哪个碱基对最常见，然后我通过以下函数获取字典的最大值并将 max/m-species 存储在单独的列表中：

def createHists(SequenceArray):
    Conserved = []
    Ind = 0
    while Ind < len(SequenceArray[0]):
        NucCounts = {"A":0, "T":0, "C":0, "G":0}
        ColumnConservation = []
        for Seqs in SequenceArray:
            ColumnConservation.append(Seqs[Ind])
        for Nucs in ColumnConservation:
            if Nucs in NucCounts:
                NucCounts[Nucs] += 1
        if NucCounts[max(NucCounts, key=NucCounts.get)] > 1:
            ConservedN = NucCounts[max(NucCounts, key=NucCounts.get)]/len(ColumnConservation)
            Conserved.append(ConservedN)
        else:
            ConservedN = 0
            Conserved.append(ConservedN)
        Ind += 1
    return(Conserved)

输出很简单：

[0.75, 0.5, 0.75, 0.5, 0.5, 0.5, 0.75, 0.75, 0.5, 0.75, 0.5, 0.5, 0.75, 0.75]

我的问题是，鉴于我要遍历每一行，有没有办法让它更快？并不是说它不够快（我的对齐文件的当前大小是 45000bp），但我想知道是否有内置库可以更有效地并行化多个列表遍历，例如 itertools。这里的主要警告是序列列表的数量是未知的，在这个例子中可能是 4，但也可能更多。

Answer 1

我想你可能有点想多了这个问题。

from collections import Counter
import numpy as np

VALID_BASES = ['A', 'T', 'G', 'C']

# Array where rows are samples and columns are split from a string
arr = np.array([['-', '-', '-', 'C', 'C', 'T'],
     ['-', '-', '-', '-', '-', '-'],
     ['-', '-', '-', '-', '-', '-'],
     ['G', 'C', 'T', '-', '-', '-']])

使用 python 计数器：

counts = [Counter(x) for x in arr.T]

然后取你有效字符的最大比例：

max_proportion = [max(count[v] / arr.shape[0] for v in VALID_BASES) for count in counts]

>>> max_proportion
[0.25, 0.25, 0.25, 0.25, 0.25, 0.25]

优化并行函数中的多个列表遍历

Optimized multiple list traversal in parallel function

python

numpy

bioinformatics