Levenshtein编辑距离如何通过traceback实现对齐

How to implement alignment through traceback for Levenshtein edit distance

所以我在Wikipedia and this Needleman教程的帮助下成功实现了Levenshtein(编辑最小距离)算法,其中自定义,插入和删除成本为1,而substitution/replacement成本为2.

可执行要点https://gist.github.com/axelmukwena/8696ec4ec72849d3cf384f5d97321407

import numpy as np

def word_edit_distance(x, y):
    rows = len(x) + 1
    cols = len(y) + 1
    distance = np.zeros((rows, cols), dtype=int)

    for i in range(1, rows):
        for k in range(1, cols):
            distance[i][0] = i
            distance[0][k] = k

    for col in range(1, cols):
        for row in range(1, rows):
            if x[row - 1] == y[col - 1]:
                cost = 0
            else:
                cost = 2
            distance[row][col] = min(distance[row - 1][col] + 1,
                                     distance[row][col - 1] + 1,
                                     distance[row - 1][col - 1] + cost)
     
    print(backtrace(x, y, distance))
    edit_distance = distance[row][col]
    return edit_distance, distance


result = word_edit_distance("AACGCA", "GAGCTA")
print(result[0])
print(result[1])
# output
4
[[0 1 2 3 4 5 6]
 [1 2 1 2 3 4 5]
 [2 3 2 3 4 5 4]
 [3 4 3 4 3 4 5]
 [4 3 4 3 4 5 6]
 [5 4 5 4 3 4 5]
 [6 5 4 5 4 5 4]]

And, I somehow also understand how to compute the backtracking, see my attempt below. However, there is a slight error. See Bottom

def backtrace(first, second, matrix):
    f = [char for char in first]
    s = [char for char in second]
    new_f, new_s = [], []
    row = len(f)
    col = len(s)
    trace = [[row, col]]

    while True:
        a = matrix[row - 1][col]
        b = matrix[row - 1][col - 1]
        c = matrix[row][col - 1]

        which = min(a, b, c)

        if which == matrix[row][col] or which == matrix[row][col] - 2:
            # when diagonal backtrace substitution or no substitution
            trace.append([row - 1, col - 1])
            new_f = [f[row - 1]] + new_f
            new_s = [s[col - 1]] + new_s

            row, col = row - 1, col - 1

        elif which == matrix[row][col] - 1:
            # either deletion or insertion, find if minimum is up or left
            if which == matrix[row - 1][col]:
                trace.append([row - 1, col])
                new_f = [f[row - 1]] + new_f
                new_s = ["-"] + new_s

                row, col = row - 1, col

            elif which == matrix[row][col - 1]:
                trace.append([row, col - 1])
                new_f = ["-"] + new_f
                new_s = [s[col - 1]] + new_s

                row, col = row, col - 1

        # Exit the loop
        if row == 0 or col == 0:
            return trace, new_f, new_s

结果

# trace => [[6, 6], [5, 5], [5, 4], [4, 3], [3, 2], [2, 2], [1, 2], [0, 1]]
['A', 'A', 'C', 'G', 'C', '-', 'A'], ['A', '-', '-', 'G', 'C', 'T', 'A']

预期结果:

# trace => [[6, 6], [5, 5], [5, 4], [4, 3], [3, 2], [2, 2], [1, 1], [0, 0]]
['A', 'A', 'C', 'G', 'C', '-', 'A'], ['A', 'A', '-', 'G', 'C', 'T', 'A']

Whats happening is:

  1. 在查找编辑距离期间,
# cost = 2
distance[row - 1][col] + 1 = 2         # orange
distance[row][col - 1] + 1 = 4         # yellow
distance[row - 1][col - 1] + cost = 2  # green 

所以在这里,橙色和绿色都是候选。但理想的候选人是绿色,因为 A == A

  1. 在回溯期间,我们没有关于序列的信息,只有矩阵中的点。所以跟踪将获得三个中最低的...
a = matrix[row - 1][col]      # a = 1
b = matrix[row - 1][col - 1]  # b = 2
c = matrix[row][col - 1]      # c = 3

which = min(a, b, c)          # which = a instead of b


我是否使用了正确的回溯算法?

不应该是min(a,b,c)。您应该 select 最小化另一个节点的分数加上从另一个节点到当前节点的操作成本的节点。

r = matrix[row][col]          # current node
a = matrix[row - 1][col]      # a = 1
b = matrix[row - 1][col - 1]  # b = 2
c = matrix[row][col - 1]      # c = 3

if x[row - 1] == y[col - 1]:
    cost = 0
else:
    cost = 2

if r == a + 1: return a
if r == b + cost: return b
if r == c + 1: return c

或更压缩的形式:

which = min(a + 1, b + cost, c + 1)

根据@tsanisl 的回答,这是我的综合实现:

def backtrace(first, second, matrix):
    f = [char for char in first]
    s = [char for char in second]
    new_f, new_s = [], []
    row = len(f)
    col = len(s)
    trace = [[row, col]]

    while True:
        if f[row - 1] == s[col - 1]:
            cost = 0
        else:
            cost = 2

        r = matrix[row][col]
        a = matrix[row - 1][col]
        b = matrix[row - 1][col - 1]
        c = matrix[row][col - 1]

        if r == b + cost:
            # when diagonal backtrace substitution or no substitution
            trace.append([row - 1, col - 1])
            new_f = [f[row - 1]] + new_f
            new_s = [s[col - 1]] + new_s

            row, col = row - 1, col - 1

        else:
            # either deletion or insertion, find if minimum is up or left
            if r == a + 1:
                trace.append([row - 1, col])
                new_f = [f[row - 1]] + new_f
                new_s = ["-"] + new_s

                row, col = row - 1, col

            elif r == c + 1:
                trace.append([row, col - 1])
                new_f = ["-"] + new_f
                new_s = [s[col - 1]] + new_s

                row, col = row, col - 1

        # Exit the loop
        if row == 0 or col == 0:
            return trace, new_f, new_s