构建 DeBruijn 图的算法给出了错误的结果
Algorithm to construct DeBruijn graph gives wrong results
我正在尝试编写一些代码,从 Python 中的一组 kmers(k 字母长字符串,DNA 测序读数)构建 DeBruijn 图,输出为边集合,连接同一个节点给其他人。
当我 运行 我的样本输入代码时:
['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
我得到:
CAG -> AGG
GAG -> AGG
而不是:
AGG -> GGG
CAG -> AGG,AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
有什么我做错的提示吗?
这是代码:
import itertools
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
y=[a[1:] for a in inp]
z=[b[:len(b)-1] for b in inp]
y.extend(z)
edjes=list(set(y))
w=[c[1:] for c in edjes]
v=[d[:len(d)-1] for d in edjes]
w.extend(v)
nodes=list(set(w))
graph={}
new=itertools.product(edjes,edjes)
for node in nodes:
for edj in new:
edje1,edje2=edj[0],edj[1]
if edje1[1:]==node and edje2[:len(edje2)-1]==node:
if edje1 in graph:
graph[edje1].append(edje2)
else:
graph[edje1]=[edje2]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
if len(v)<1:
continue
else:
line=k+' -> '+','.join(v)+'\n'
print (line)
我认为你的算法太复杂了:你可以简单地先对输入执行唯一性过滤器:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
然后遍历 "edges" 这个列表。对于每条边,前三个字符是 from 节点,后三个字符是 to 节点:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
#...
现在您只需将其添加到图表中:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
最后像您自己一样进行排序和打印:
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
这导致:
AGG -> GGG
CAG -> AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
如您所见,第 2
行略有不同:您的预期输出包含 AGG
两次,这没有多大意义。
所以完整的算法是这样的:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
graph={}
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
你的算法
我认为的一个问题是,您将三个字母序列视为 "edjes"(可能是 edges)。边缘是四个序列字符。通过执行此转换,信息将丢失。接下来构造一组双字符项(nodes
,它们根本不是节点)。他们似乎习惯"glue"把节点放在一起。但在那个阶段,你不再知道这些碎片是如何粘合在一起的。
我正在尝试编写一些代码,从 Python 中的一组 kmers(k 字母长字符串,DNA 测序读数)构建 DeBruijn 图,输出为边集合,连接同一个节点给其他人。
当我 运行 我的样本输入代码时:
['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
我得到:
CAG -> AGG
GAG -> AGG
而不是:
AGG -> GGG
CAG -> AGG,AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
有什么我做错的提示吗?
这是代码:
import itertools
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
y=[a[1:] for a in inp]
z=[b[:len(b)-1] for b in inp]
y.extend(z)
edjes=list(set(y))
w=[c[1:] for c in edjes]
v=[d[:len(d)-1] for d in edjes]
w.extend(v)
nodes=list(set(w))
graph={}
new=itertools.product(edjes,edjes)
for node in nodes:
for edj in new:
edje1,edje2=edj[0],edj[1]
if edje1[1:]==node and edje2[:len(edje2)-1]==node:
if edje1 in graph:
graph[edje1].append(edje2)
else:
graph[edje1]=[edje2]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
if len(v)<1:
continue
else:
line=k+' -> '+','.join(v)+'\n'
print (line)
我认为你的算法太复杂了:你可以简单地先对输入执行唯一性过滤器:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
然后遍历 "edges" 这个列表。对于每条边,前三个字符是 from 节点,后三个字符是 to 节点:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
#...
现在您只需将其添加到图表中:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
最后像您自己一样进行排序和打印:
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
这导致:
AGG -> GGG
CAG -> AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
如您所见,第 2
行略有不同:您的预期输出包含 AGG
两次,这没有多大意义。
所以完整的算法是这样的:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
graph={}
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
你的算法
我认为的一个问题是,您将三个字母序列视为 "edjes"(可能是 edges)。边缘是四个序列字符。通过执行此转换,信息将丢失。接下来构造一组双字符项(nodes
,它们根本不是节点)。他们似乎习惯"glue"把节点放在一起。但在那个阶段,你不再知道这些碎片是如何粘合在一起的。