使用字典数据结构在有向图中查找具有至少 3 个节点的所有循环
Find all cycles with at least 3 nodes in a directed graph using dictionary data structure
上图是使用 LaTeX 绘制的:https://www.overleaf.com/read/rxhpghzbkhby
上图在Python.
中表示为字典
graph = {
'A' : ['B','D', 'C'],
'B' : ['C'],
'C' : [],
'D' : ['E'],
'E' : ['G'],
'F' : ['A', 'I'],
'G' : ['A', 'K'],
'H' : ['F', 'G'],
'I' : ['H'],
'J' : ['A'],
'K' : []
}
我有一个大约有 3,378,546 个节点的大图。
鉴于上述有向图,我试图找到具有至少 3 个且少于 5 个不同节点的圆,并输出前 3 个圆。
这个问题我花了1天半的时间。我查看了 Whosebug,甚至尝试按照此 Detect Cycle in a Directed Graph 教程进行操作,但无法找到解决方案。
在此示例中,输出是一个制表符分隔的文本文件,其中每行都有一个循环。
0 A, D, E, G
1 F, I, H
0
和 1
是索引。
此外,图节点的字母表中没有顺序。
我试过这种形式How to implement depth-first search in Python教程:
visited = set()
def dfs(visited, graph, node):
if node not in visited:
print (node)
visited.add(node)
for neighbour in graph[node]:
dfs(visited, graph, neighbour)
dfs(visited, graph, 'A')
但这无济于事。我也试过这个 Post
这是一个注释代码,它会打印包含找到的循环的数组。我认为将 return 值调整为所需的格式(我认为在你的情况下为 CSV)不需要更多。
可能是因为有 3M 个节点,结果很慢。然后我会建议采用动态编程方式和 caching/memoize 一些递归的结果,以免重复它们。
我希望这能解决您的问题或至少有所帮助。
def cycles_rec(root, current_node, graph, depth, visited, min_depth, max_depth):
depth += 1
# First part our stop conditions
if current_node in visited or current_node not in graph.keys():
return ''
if depth >= max_depth:
return ''
visited.append(current_node)
if root in graph[current_node] and depth >= min_depth:
return current_node
# The recursive part
# for each connection we try to find recursively one that would cycle back to our root
for connections in graph[current_node]:
for connection in connections:
result = cycles_rec(root, connection, graph, depth, visited, min_depth, max_depth)
# If a match was found, it would "bubble up" here, we can return it along with the
# current connection that "found it"
if result != '':
return current_node + ' ' + result
# If we are here we found no cycle
return ''
def cycles(graph, min_depth = 3, max_depth = 5):
cycles = {}
for node, connections in graph.items():
for connection in connections:
visited = []
# Let the recursion begin here
result = cycles_rec(node, connection, graph, 1, visited, min_depth, max_depth)
if result == '':
continue
# Here we found a cycle.
# Fingerprint is only necessary in order to not repeat the cycles found in the results
# It could be ignored if repeating them is not important
# It's based on the fact that nodes are all represented as letters here
# It could be it's own function returning a hash for example if nodes have a more
# complex representation
fingerprint = ''.join(sorted(list(node + ' ' + result)))
if fingerprint not in cycles.keys():
cycles[fingerprint] = node + ' ' + result
return list(cycles.values())
因此,假设您在示例中声明的图形变量:
print(cycles(graph, 3, 5))
会打印出来
['A D E G', 'F I H']
注意:此解决方案是对描述解决方案的扩展解决方案。我扩展到具有约 300 万个节点的原始图,我寻找至少 3 个节点且少于 40 个节点的所有循环,并将前 3 个循环存储到一个文件中。
我想到了以下解决方案。
# implementation of Johnson's cycle finding algorithm
# Original paper: Donald B Johnson. "Finding all the elementary circuits of a directed graph." SIAM Journal on Computing. 1975.
from collections import defaultdict
import networkx as nx
from networkx.utils import not_implemented_for, pairwise
@not_implemented_for("undirected")
def findCycles(G):
"""Find simple cycles of a directed graph.
A `simple cycle` is a closed path where no node appears twice.
Two elementary circuits are distinct if they are not cyclic permutations of each other.
This is iterator/generator version of Johnson's algorithm [1]_.
There may be better algorithms for some cases [2]_ [3]_.
Parameters
----------
G : NetworkX DiGraph
A directed graph
Returns
-------
cycle_generator: generator
A generator that produces elementary cycles of the graph.
Each cycle is represented by a list of nodes along the cycle.
Examples
--------
>>> graph = {'A' : ['B','D', 'C'],
'B' : ['C'],
'C' : [],
'D' : ['E'],
'E' : ['G'],
'F' : ['A', 'I'],
'G' : ['A', 'K'],
'H' : ['F', 'G'],
'I' : ['H'],
'J' : ['A'],
'K' : []
}
>>> G = nx.DiGraph()
>>> G.add_nodes_from(graph.keys())
>>> for keys, values in graph.items():
G.add_edges_from(([(keys, node) for node in values]))
>>> list(nx.findCycles(G))
[['F', 'I', 'H'], ['G', 'A', 'D', 'E']]
Notes
-----
The implementation follows pp. 79-80 in [1]_.
The time complexity is $O((n+e)(c+1))$ for $n$ nodes, $e$ edges and $c$
elementary circuits.
References
----------
.. [1] Finding all the elementary circuits of a directed graph.
D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
https://doi.org/10.1137/0204007
.. [2] Enumerating the cycles of a digraph: a new preprocessing strategy.
G. Loizou and P. Thanish, Information Sciences, v. 27, 163-182, 1982.
.. [3] A search strategy for the elementary cycles of a directed graph.
J.L. Szwarcfiter and P.E. Lauer, BIT NUMERICAL MATHEMATICS,
v. 16, no. 2, 192-204, 1976.
--------
"""
def _unblock(thisnode, blocked, B):
stack = {thisnode}
while stack:
node = stack.pop()
if node in blocked:
blocked.remove(node)
stack.update(B[node])
B[node].clear()
# Johnson's algorithm requires some ordering of the nodes.
# We assign the arbitrary ordering given by the strongly connected comps
# There is no need to track the ordering as each node removed as processed.
# Also we save the actual graph so we can mutate it. We only take the
# edges because we do not want to copy edge and node attributes here.
subG = type(G)(G.edges())
sccs = [scc for scc in nx.strongly_connected_components(subG) if len(scc) in list(range(3, 41))]
# Johnson's algorithm exclude self cycle edges like (v, v)
# To be backward compatible, we record those cycles in advance
# and then remove from subG
for v in subG:
if subG.has_edge(v, v):
yield [v]
subG.remove_edge(v, v)
while sccs:
scc = sccs.pop()
sccG = subG.subgraph(scc)
# order of scc determines ordering of nodes
startnode = scc.pop()
# Processing node runs "circuit" routine from recursive version
path = [startnode]
blocked = set() # vertex: blocked from search?
closed = set() # nodes involved in a cycle
blocked.add(startnode)
B = defaultdict(set) # graph portions that yield no elementary circuit
stack = [(startnode, list(sccG[startnode]))] # sccG gives comp nbrs
while stack:
thisnode, nbrs = stack[-1]
if nbrs:
nextnode = nbrs.pop()
if nextnode == startnode:
yield path[:]
closed.update(path)
# print "Found a cycle", path, closed
elif nextnode not in blocked:
path.append(nextnode)
stack.append((nextnode, list(sccG[nextnode])))
closed.discard(nextnode)
blocked.add(nextnode)
continue
# done with nextnode... look for more neighbors
if not nbrs: # no more nbrs
if thisnode in closed:
_unblock(thisnode, blocked, B)
else:
for nbr in sccG[thisnode]:
if thisnode not in B[nbr]:
B[nbr].add(thisnode)
stack.pop()
path.pop()
# done processing this node
H = subG.subgraph(scc) # make smaller to avoid work in SCC routine
sccs.extend(scc for scc in nx.strongly_connected_components(H) if len(scc) in list(range(3, 41)))
import sys, csv, json
def findAllCycles(jsonInputFile, textOutFile):
"""Find simple cycles of a directed graph (jsonInputFile).
Parameters:
----------
jsonInputFile: a json file that has all concepts
textOutFile: give a desired name of output file
Returns:
----------
a .text file (named: {textOutFile}.txt) has the first 3 cycles found in jsonInputFile
Each cycle is represented by a list of nodes along the cycle
"""
with open(jsonInputFile) as infile:
graph = json.load(infile)
# Convert the json file to a NetworkX directed graph
G = nx.DiGraph()
G.add_nodes_from(graph.keys())
for keys, values in graph.items():
G.add_edges_from(([(keys, node) for node in values]))
# Search for all simple cycles existed in the graph
_cycles = list(findCycles(G))
# Start with an empty list and populate it by looping over all cycles
# in _cycles that have at least 3 and less than 40 different concepts (nodes)
cycles = []
for cycle in _cycles:
if len(cycle) in list(range(3, 41)):
cycles.append(cycle)
# Store the cycels under constraint in {textOutFile}.txt
with open(textOutFile, 'w') as outfile:
for cycle in cycles[:3]:
outfile.write(','.join(n for n in cycle)+'\n')
outfile.close()
# When process finishes, print Done!!
return 'Done!!'
infile = sys.argv[1]
outfile = sys.argv[2]
first_cycles = findAllCycles(infile, outfile)
要运行这个程序,你只需使用命令行如下:
>> python3 {program file name}.py graph.json {desired output file name}[.txt][.csv]
让,例如{所需的输出文件名}}.[txt][.csv],为first_3_cycles_found.txt
在我的例子中,该图有 3,378,546 个节点,我花了大约 40 分钟的时间使用上述代码找到所有循环。因此,输出文件将是:
如果您认为它需要任何改进或添加其他内容,请为此做出贡献。
上图是使用 LaTeX 绘制的:https://www.overleaf.com/read/rxhpghzbkhby
上图在Python.
中表示为字典graph = {
'A' : ['B','D', 'C'],
'B' : ['C'],
'C' : [],
'D' : ['E'],
'E' : ['G'],
'F' : ['A', 'I'],
'G' : ['A', 'K'],
'H' : ['F', 'G'],
'I' : ['H'],
'J' : ['A'],
'K' : []
}
我有一个大约有 3,378,546 个节点的大图。
鉴于上述有向图,我试图找到具有至少 3 个且少于 5 个不同节点的圆,并输出前 3 个圆。
这个问题我花了1天半的时间。我查看了 Whosebug,甚至尝试按照此 Detect Cycle in a Directed Graph 教程进行操作,但无法找到解决方案。
在此示例中,输出是一个制表符分隔的文本文件,其中每行都有一个循环。
0 A, D, E, G
1 F, I, H
0
和 1
是索引。
此外,图节点的字母表中没有顺序。
我试过这种形式How to implement depth-first search in Python教程:
visited = set()
def dfs(visited, graph, node):
if node not in visited:
print (node)
visited.add(node)
for neighbour in graph[node]:
dfs(visited, graph, neighbour)
dfs(visited, graph, 'A')
但这无济于事。我也试过这个 Post
这是一个注释代码,它会打印包含找到的循环的数组。我认为将 return 值调整为所需的格式(我认为在你的情况下为 CSV)不需要更多。
可能是因为有 3M 个节点,结果很慢。然后我会建议采用动态编程方式和 caching/memoize 一些递归的结果,以免重复它们。
我希望这能解决您的问题或至少有所帮助。
def cycles_rec(root, current_node, graph, depth, visited, min_depth, max_depth):
depth += 1
# First part our stop conditions
if current_node in visited or current_node not in graph.keys():
return ''
if depth >= max_depth:
return ''
visited.append(current_node)
if root in graph[current_node] and depth >= min_depth:
return current_node
# The recursive part
# for each connection we try to find recursively one that would cycle back to our root
for connections in graph[current_node]:
for connection in connections:
result = cycles_rec(root, connection, graph, depth, visited, min_depth, max_depth)
# If a match was found, it would "bubble up" here, we can return it along with the
# current connection that "found it"
if result != '':
return current_node + ' ' + result
# If we are here we found no cycle
return ''
def cycles(graph, min_depth = 3, max_depth = 5):
cycles = {}
for node, connections in graph.items():
for connection in connections:
visited = []
# Let the recursion begin here
result = cycles_rec(node, connection, graph, 1, visited, min_depth, max_depth)
if result == '':
continue
# Here we found a cycle.
# Fingerprint is only necessary in order to not repeat the cycles found in the results
# It could be ignored if repeating them is not important
# It's based on the fact that nodes are all represented as letters here
# It could be it's own function returning a hash for example if nodes have a more
# complex representation
fingerprint = ''.join(sorted(list(node + ' ' + result)))
if fingerprint not in cycles.keys():
cycles[fingerprint] = node + ' ' + result
return list(cycles.values())
因此,假设您在示例中声明的图形变量:
print(cycles(graph, 3, 5))
会打印出来
['A D E G', 'F I H']
注意:此解决方案是对描述解决方案的扩展解决方案。我扩展到具有约 300 万个节点的原始图,我寻找至少 3 个节点且少于 40 个节点的所有循环,并将前 3 个循环存储到一个文件中。
我想到了以下解决方案。
# implementation of Johnson's cycle finding algorithm
# Original paper: Donald B Johnson. "Finding all the elementary circuits of a directed graph." SIAM Journal on Computing. 1975.
from collections import defaultdict
import networkx as nx
from networkx.utils import not_implemented_for, pairwise
@not_implemented_for("undirected")
def findCycles(G):
"""Find simple cycles of a directed graph.
A `simple cycle` is a closed path where no node appears twice.
Two elementary circuits are distinct if they are not cyclic permutations of each other.
This is iterator/generator version of Johnson's algorithm [1]_.
There may be better algorithms for some cases [2]_ [3]_.
Parameters
----------
G : NetworkX DiGraph
A directed graph
Returns
-------
cycle_generator: generator
A generator that produces elementary cycles of the graph.
Each cycle is represented by a list of nodes along the cycle.
Examples
--------
>>> graph = {'A' : ['B','D', 'C'],
'B' : ['C'],
'C' : [],
'D' : ['E'],
'E' : ['G'],
'F' : ['A', 'I'],
'G' : ['A', 'K'],
'H' : ['F', 'G'],
'I' : ['H'],
'J' : ['A'],
'K' : []
}
>>> G = nx.DiGraph()
>>> G.add_nodes_from(graph.keys())
>>> for keys, values in graph.items():
G.add_edges_from(([(keys, node) for node in values]))
>>> list(nx.findCycles(G))
[['F', 'I', 'H'], ['G', 'A', 'D', 'E']]
Notes
-----
The implementation follows pp. 79-80 in [1]_.
The time complexity is $O((n+e)(c+1))$ for $n$ nodes, $e$ edges and $c$
elementary circuits.
References
----------
.. [1] Finding all the elementary circuits of a directed graph.
D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
https://doi.org/10.1137/0204007
.. [2] Enumerating the cycles of a digraph: a new preprocessing strategy.
G. Loizou and P. Thanish, Information Sciences, v. 27, 163-182, 1982.
.. [3] A search strategy for the elementary cycles of a directed graph.
J.L. Szwarcfiter and P.E. Lauer, BIT NUMERICAL MATHEMATICS,
v. 16, no. 2, 192-204, 1976.
--------
"""
def _unblock(thisnode, blocked, B):
stack = {thisnode}
while stack:
node = stack.pop()
if node in blocked:
blocked.remove(node)
stack.update(B[node])
B[node].clear()
# Johnson's algorithm requires some ordering of the nodes.
# We assign the arbitrary ordering given by the strongly connected comps
# There is no need to track the ordering as each node removed as processed.
# Also we save the actual graph so we can mutate it. We only take the
# edges because we do not want to copy edge and node attributes here.
subG = type(G)(G.edges())
sccs = [scc for scc in nx.strongly_connected_components(subG) if len(scc) in list(range(3, 41))]
# Johnson's algorithm exclude self cycle edges like (v, v)
# To be backward compatible, we record those cycles in advance
# and then remove from subG
for v in subG:
if subG.has_edge(v, v):
yield [v]
subG.remove_edge(v, v)
while sccs:
scc = sccs.pop()
sccG = subG.subgraph(scc)
# order of scc determines ordering of nodes
startnode = scc.pop()
# Processing node runs "circuit" routine from recursive version
path = [startnode]
blocked = set() # vertex: blocked from search?
closed = set() # nodes involved in a cycle
blocked.add(startnode)
B = defaultdict(set) # graph portions that yield no elementary circuit
stack = [(startnode, list(sccG[startnode]))] # sccG gives comp nbrs
while stack:
thisnode, nbrs = stack[-1]
if nbrs:
nextnode = nbrs.pop()
if nextnode == startnode:
yield path[:]
closed.update(path)
# print "Found a cycle", path, closed
elif nextnode not in blocked:
path.append(nextnode)
stack.append((nextnode, list(sccG[nextnode])))
closed.discard(nextnode)
blocked.add(nextnode)
continue
# done with nextnode... look for more neighbors
if not nbrs: # no more nbrs
if thisnode in closed:
_unblock(thisnode, blocked, B)
else:
for nbr in sccG[thisnode]:
if thisnode not in B[nbr]:
B[nbr].add(thisnode)
stack.pop()
path.pop()
# done processing this node
H = subG.subgraph(scc) # make smaller to avoid work in SCC routine
sccs.extend(scc for scc in nx.strongly_connected_components(H) if len(scc) in list(range(3, 41)))
import sys, csv, json
def findAllCycles(jsonInputFile, textOutFile):
"""Find simple cycles of a directed graph (jsonInputFile).
Parameters:
----------
jsonInputFile: a json file that has all concepts
textOutFile: give a desired name of output file
Returns:
----------
a .text file (named: {textOutFile}.txt) has the first 3 cycles found in jsonInputFile
Each cycle is represented by a list of nodes along the cycle
"""
with open(jsonInputFile) as infile:
graph = json.load(infile)
# Convert the json file to a NetworkX directed graph
G = nx.DiGraph()
G.add_nodes_from(graph.keys())
for keys, values in graph.items():
G.add_edges_from(([(keys, node) for node in values]))
# Search for all simple cycles existed in the graph
_cycles = list(findCycles(G))
# Start with an empty list and populate it by looping over all cycles
# in _cycles that have at least 3 and less than 40 different concepts (nodes)
cycles = []
for cycle in _cycles:
if len(cycle) in list(range(3, 41)):
cycles.append(cycle)
# Store the cycels under constraint in {textOutFile}.txt
with open(textOutFile, 'w') as outfile:
for cycle in cycles[:3]:
outfile.write(','.join(n for n in cycle)+'\n')
outfile.close()
# When process finishes, print Done!!
return 'Done!!'
infile = sys.argv[1]
outfile = sys.argv[2]
first_cycles = findAllCycles(infile, outfile)
要运行这个程序,你只需使用命令行如下:
>> python3 {program file name}.py graph.json {desired output file name}[.txt][.csv]
让,例如{所需的输出文件名}}.[txt][.csv],为first_3_cycles_found.txt
在我的例子中,该图有 3,378,546 个节点,我花了大约 40 分钟的时间使用上述代码找到所有循环。因此,输出文件将是:
如果您认为它需要任何改进或添加其他内容,请为此做出贡献。