在 python 中实现用于局部对齐的 Smith-Waterman 算法

Implementing Smith-Waterman algorithm for local alignment in python

我创建了一个序列比对工具来比较两条 DNA 链(X 和 Y),以找到 X 和 Y 的子串的最佳比对。算法总结如下(https://en.wikipedia.org/wiki/Smith–Waterman_algorithm)。我已经能够生成一个列表列表,用零填充它们,以表示我的矩阵。我创建了一个评分算法,以 return 对碱基之间的每种比对进行数值评分(例如,匹配加 4)。然后我创建了一个对齐算法,应该在我的“矩阵”的每个坐标中放置一个分数。但是,当我去打印矩阵时,它只 return 是全零的原始值(而不是实际分数)。

我知道还有其他方法可以实现此方法(例如使用 numpy),那么您能告诉我为什么这个特定代码(下面)不起作用吗?有没有办法修改它,使其正常工作?

代码:

def zeros(X: int, Y: int):
    lenX = len(X) + 1
    lenY = len(Y) + 1
    matrix = []
    for i in range(lenX):
        matrix.append([0] * lenY)
    
    def score(X, Y):
        if X[n] == Y[m]: return 4
        if X[n] == '-' or Y[m] == '-': return -4
        else: return -2

    def SmithWaterman(X, Y, score):
        for n in range(1, len(X) + 1):
            for m in range(1, len(Y) + 1):
                align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
                indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
                indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
        matrix[n, m] = max(align, indelX, indelY, 0)
    print(matrix)

zeros("ACGT", "ACGT")

输出:

[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]

它只是打印出归零矩阵的原因是 SmithWaterman 函数从未被调用,所以矩阵永远不会更新。

你需要做类似的事情

# ...
SmithWaterman(X, Y, score)
print(matrix)
# ...

但是,如果您这样做,您会发现这段代码实际上在许多其他方面都存在问题。我已经完成并注释了一些语法错误和代码的其他问题:

def zeros(X: int, Y: int):
    #        ^       ^  incorrect type annotations. should be str
    lenX = len(X) + 1
    lenY = len(Y) + 1
    matrix = []
    for i in range(lenX):
        matrix.append([0] * lenY)
    # A more "pythonic" way of expressing the above would be:
    # matrix = [[0] * len(Y) + 1 for _ in range(len(x) + 1)]
    
    def score(X, Y):
        #     ^  ^ shadowing variables from outer scope. this is not a bug per se but it's considered bad practice
        if X[n] == Y[m]: return 4
        #    ^       ^  variables not defined in scope
        if X[n] == '-' or Y[m] == '-': return -4
        #    ^              ^  variables not defined in scope
        else: return -2

    def SmithWaterman(X, Y, score): # this function is never called
        #                   ^ unnecessary function passed as parameter. function is defined in scope
        for n in range(1, len(X) + 1):
            for m in range(1, len(Y) + 1):
                align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
                #                 ^ invalid list lookup. should be: matrix[n-1][m-1]
                indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
                #                                          ^ out of bounds error when m == len(Y)
                indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
                #                                  ^ out of bounds error when n == len(X)
        matrix[n, m] = max(align, indelX, indelY, 0)
        # this should be nested in the inner for-loop. m, n, indelX, and indelY are not defined in scope here
    print(matrix)

zeros("ACGT", "ACGT")