Levenshtein编辑距离如何通过traceback实现对齐
How to implement alignment through traceback for Levenshtein edit distance
所以我在Wikipedia and this Needleman教程的帮助下成功实现了Levenshtein(编辑最小距离)算法,其中自定义,插入和删除成本为1,而substitution/replacement成本为2.
可执行要点https://gist.github.com/axelmukwena/8696ec4ec72849d3cf384f5d97321407
import numpy as np
def word_edit_distance(x, y):
rows = len(x) + 1
cols = len(y) + 1
distance = np.zeros((rows, cols), dtype=int)
for i in range(1, rows):
for k in range(1, cols):
distance[i][0] = i
distance[0][k] = k
for col in range(1, cols):
for row in range(1, rows):
if x[row - 1] == y[col - 1]:
cost = 0
else:
cost = 2
distance[row][col] = min(distance[row - 1][col] + 1,
distance[row][col - 1] + 1,
distance[row - 1][col - 1] + cost)
print(backtrace(x, y, distance))
edit_distance = distance[row][col]
return edit_distance, distance
result = word_edit_distance("AACGCA", "GAGCTA")
print(result[0])
print(result[1])
# output
4
[[0 1 2 3 4 5 6]
[1 2 1 2 3 4 5]
[2 3 2 3 4 5 4]
[3 4 3 4 3 4 5]
[4 3 4 3 4 5 6]
[5 4 5 4 3 4 5]
[6 5 4 5 4 5 4]]
And, I somehow also understand how to compute the backtracking, see my attempt below. However, there is a slight error. See Bottom
def backtrace(first, second, matrix):
f = [char for char in first]
s = [char for char in second]
new_f, new_s = [], []
row = len(f)
col = len(s)
trace = [[row, col]]
while True:
a = matrix[row - 1][col]
b = matrix[row - 1][col - 1]
c = matrix[row][col - 1]
which = min(a, b, c)
if which == matrix[row][col] or which == matrix[row][col] - 2:
# when diagonal backtrace substitution or no substitution
trace.append([row - 1, col - 1])
new_f = [f[row - 1]] + new_f
new_s = [s[col - 1]] + new_s
row, col = row - 1, col - 1
elif which == matrix[row][col] - 1:
# either deletion or insertion, find if minimum is up or left
if which == matrix[row - 1][col]:
trace.append([row - 1, col])
new_f = [f[row - 1]] + new_f
new_s = ["-"] + new_s
row, col = row - 1, col
elif which == matrix[row][col - 1]:
trace.append([row, col - 1])
new_f = ["-"] + new_f
new_s = [s[col - 1]] + new_s
row, col = row, col - 1
# Exit the loop
if row == 0 or col == 0:
return trace, new_f, new_s
结果
# trace => [[6, 6], [5, 5], [5, 4], [4, 3], [3, 2], [2, 2], [1, 2], [0, 1]]
['A', 'A', 'C', 'G', 'C', '-', 'A'], ['A', '-', '-', 'G', 'C', 'T', 'A']
预期结果:
# trace => [[6, 6], [5, 5], [5, 4], [4, 3], [3, 2], [2, 2], [1, 1], [0, 0]]
['A', 'A', 'C', 'G', 'C', '-', 'A'], ['A', 'A', '-', 'G', 'C', 'T', 'A']
Whats happening is:
- 在查找编辑距离期间,
# cost = 2
distance[row - 1][col] + 1 = 2 # orange
distance[row][col - 1] + 1 = 4 # yellow
distance[row - 1][col - 1] + cost = 2 # green
所以在这里,橙色和绿色都是候选。但理想的候选人是绿色,因为 A == A
- 在回溯期间,我们没有关于序列的信息,只有矩阵中的点。所以跟踪将获得三个中最低的...
a = matrix[row - 1][col] # a = 1
b = matrix[row - 1][col - 1] # b = 2
c = matrix[row][col - 1] # c = 3
which = min(a, b, c) # which = a instead of b
我是否使用了正确的回溯算法?
不应该是min(a,b,c)
。您应该 select 最小化另一个节点的分数加上从另一个节点到当前节点的操作成本的节点。
r = matrix[row][col] # current node
a = matrix[row - 1][col] # a = 1
b = matrix[row - 1][col - 1] # b = 2
c = matrix[row][col - 1] # c = 3
if x[row - 1] == y[col - 1]:
cost = 0
else:
cost = 2
if r == a + 1: return a
if r == b + cost: return b
if r == c + 1: return c
或更压缩的形式:
which = min(a + 1, b + cost, c + 1)
根据@tsanisl 的回答,这是我的综合实现:
def backtrace(first, second, matrix):
f = [char for char in first]
s = [char for char in second]
new_f, new_s = [], []
row = len(f)
col = len(s)
trace = [[row, col]]
while True:
if f[row - 1] == s[col - 1]:
cost = 0
else:
cost = 2
r = matrix[row][col]
a = matrix[row - 1][col]
b = matrix[row - 1][col - 1]
c = matrix[row][col - 1]
if r == b + cost:
# when diagonal backtrace substitution or no substitution
trace.append([row - 1, col - 1])
new_f = [f[row - 1]] + new_f
new_s = [s[col - 1]] + new_s
row, col = row - 1, col - 1
else:
# either deletion or insertion, find if minimum is up or left
if r == a + 1:
trace.append([row - 1, col])
new_f = [f[row - 1]] + new_f
new_s = ["-"] + new_s
row, col = row - 1, col
elif r == c + 1:
trace.append([row, col - 1])
new_f = ["-"] + new_f
new_s = [s[col - 1]] + new_s
row, col = row, col - 1
# Exit the loop
if row == 0 or col == 0:
return trace, new_f, new_s
所以我在Wikipedia and this Needleman教程的帮助下成功实现了Levenshtein(编辑最小距离)算法,其中自定义,插入和删除成本为1,而substitution/replacement成本为2.
可执行要点https://gist.github.com/axelmukwena/8696ec4ec72849d3cf384f5d97321407
import numpy as np
def word_edit_distance(x, y):
rows = len(x) + 1
cols = len(y) + 1
distance = np.zeros((rows, cols), dtype=int)
for i in range(1, rows):
for k in range(1, cols):
distance[i][0] = i
distance[0][k] = k
for col in range(1, cols):
for row in range(1, rows):
if x[row - 1] == y[col - 1]:
cost = 0
else:
cost = 2
distance[row][col] = min(distance[row - 1][col] + 1,
distance[row][col - 1] + 1,
distance[row - 1][col - 1] + cost)
print(backtrace(x, y, distance))
edit_distance = distance[row][col]
return edit_distance, distance
result = word_edit_distance("AACGCA", "GAGCTA")
print(result[0])
print(result[1])
# output
4
[[0 1 2 3 4 5 6]
[1 2 1 2 3 4 5]
[2 3 2 3 4 5 4]
[3 4 3 4 3 4 5]
[4 3 4 3 4 5 6]
[5 4 5 4 3 4 5]
[6 5 4 5 4 5 4]]
And, I somehow also understand how to compute the backtracking, see my attempt below. However, there is a slight error. See Bottom
def backtrace(first, second, matrix):
f = [char for char in first]
s = [char for char in second]
new_f, new_s = [], []
row = len(f)
col = len(s)
trace = [[row, col]]
while True:
a = matrix[row - 1][col]
b = matrix[row - 1][col - 1]
c = matrix[row][col - 1]
which = min(a, b, c)
if which == matrix[row][col] or which == matrix[row][col] - 2:
# when diagonal backtrace substitution or no substitution
trace.append([row - 1, col - 1])
new_f = [f[row - 1]] + new_f
new_s = [s[col - 1]] + new_s
row, col = row - 1, col - 1
elif which == matrix[row][col] - 1:
# either deletion or insertion, find if minimum is up or left
if which == matrix[row - 1][col]:
trace.append([row - 1, col])
new_f = [f[row - 1]] + new_f
new_s = ["-"] + new_s
row, col = row - 1, col
elif which == matrix[row][col - 1]:
trace.append([row, col - 1])
new_f = ["-"] + new_f
new_s = [s[col - 1]] + new_s
row, col = row, col - 1
# Exit the loop
if row == 0 or col == 0:
return trace, new_f, new_s
结果
# trace => [[6, 6], [5, 5], [5, 4], [4, 3], [3, 2], [2, 2], [1, 2], [0, 1]]
['A', 'A', 'C', 'G', 'C', '-', 'A'], ['A', '-', '-', 'G', 'C', 'T', 'A']
预期结果:
# trace => [[6, 6], [5, 5], [5, 4], [4, 3], [3, 2], [2, 2], [1, 1], [0, 0]]
['A', 'A', 'C', 'G', 'C', '-', 'A'], ['A', 'A', '-', 'G', 'C', 'T', 'A']
Whats happening is:
- 在查找编辑距离期间,
# cost = 2
distance[row - 1][col] + 1 = 2 # orange
distance[row][col - 1] + 1 = 4 # yellow
distance[row - 1][col - 1] + cost = 2 # green
所以在这里,橙色和绿色都是候选。但理想的候选人是绿色,因为 A == A
- 在回溯期间,我们没有关于序列的信息,只有矩阵中的点。所以跟踪将获得三个中最低的...
a = matrix[row - 1][col] # a = 1
b = matrix[row - 1][col - 1] # b = 2
c = matrix[row][col - 1] # c = 3
which = min(a, b, c) # which = a instead of b
我是否使用了正确的回溯算法?
不应该是min(a,b,c)
。您应该 select 最小化另一个节点的分数加上从另一个节点到当前节点的操作成本的节点。
r = matrix[row][col] # current node
a = matrix[row - 1][col] # a = 1
b = matrix[row - 1][col - 1] # b = 2
c = matrix[row][col - 1] # c = 3
if x[row - 1] == y[col - 1]:
cost = 0
else:
cost = 2
if r == a + 1: return a
if r == b + cost: return b
if r == c + 1: return c
或更压缩的形式:
which = min(a + 1, b + cost, c + 1)
根据@tsanisl 的回答,这是我的综合实现:
def backtrace(first, second, matrix):
f = [char for char in first]
s = [char for char in second]
new_f, new_s = [], []
row = len(f)
col = len(s)
trace = [[row, col]]
while True:
if f[row - 1] == s[col - 1]:
cost = 0
else:
cost = 2
r = matrix[row][col]
a = matrix[row - 1][col]
b = matrix[row - 1][col - 1]
c = matrix[row][col - 1]
if r == b + cost:
# when diagonal backtrace substitution or no substitution
trace.append([row - 1, col - 1])
new_f = [f[row - 1]] + new_f
new_s = [s[col - 1]] + new_s
row, col = row - 1, col - 1
else:
# either deletion or insertion, find if minimum is up or left
if r == a + 1:
trace.append([row - 1, col])
new_f = [f[row - 1]] + new_f
new_s = ["-"] + new_s
row, col = row - 1, col
elif r == c + 1:
trace.append([row, col - 1])
new_f = ["-"] + new_f
new_s = [s[col - 1]] + new_s
row, col = row, col - 1
# Exit the loop
if row == 0 or col == 0:
return trace, new_f, new_s