使用字典数据结构在有向图中查找具有至少 3 个节点的所有循环

Find all cycles with at least 3 nodes in a directed graph using dictionary data structure

上图是使用 LaTeX 绘制的:https://www.overleaf.com/read/rxhpghzbkhby

上图在Python.

中表示为字典
graph = {
    'A' : ['B','D', 'C'],
    'B' : ['C'],
    'C' : [],
    'D' : ['E'],
    'E' : ['G'],
    'F' : ['A', 'I'],
    'G' : ['A', 'K'],
    'H' : ['F', 'G'],
    'I' : ['H'],
    'J' : ['A'],
    'K' : []
}

我有一个大约有 3,378,546 个节点的大图。

鉴于上述有向图,我试图找到具有至少 3 个且少于 5 个不同节点的圆,并输出前 3 个圆。

这个问题我花了1天半的时间。我查看了 Whosebug,甚至尝试按照此 Detect Cycle in a Directed Graph 教程进行操作,但无法找到解决方案。

在此示例中,输出是一个制表符分隔的文本文件,其中每行都有一个循环。

0 A, D, E, G
1 F, I, H

01 是索引。 此外,图节点的字母表中没有顺序。

我试过这种形式How to implement depth-first search in Python教程:

visited = set()

def dfs(visited, graph, node):
    if node not in visited:
        print (node)
        visited.add(node)
        for neighbour in graph[node]:
            dfs(visited, graph, neighbour)

dfs(visited, graph, 'A')

但这无济于事。我也试过这个 Post

这是一个注释代码,它会打印包含找到的循环的数组。我认为将 return 值调整为所需的格式(我认为在你的情况下为 CSV)不需要更多。

可能是因为有 3M 个节点,结果很慢。然后我会建议采用动态编程方式和 caching/memoize 一些递归的结果,以免重复它们。

我希望这能解决您的问题或至少有所帮助。

def cycles_rec(root, current_node, graph, depth, visited, min_depth, max_depth):
    depth += 1

    # First part our stop conditions
    if current_node in visited or current_node not in graph.keys():
        return ''

    if depth >= max_depth:
        return ''

    visited.append(current_node)

    if root in graph[current_node] and depth >= min_depth:
        return current_node

    # The recursive part
    # for each connection we try to find recursively one that would cycle back to our root
    for connections in graph[current_node]:
        for connection in connections:
            result = cycles_rec(root, connection, graph, depth, visited, min_depth, max_depth)
            # If a match was found, it would "bubble up" here, we can return it along with the
            # current connection that "found it"
            if result != '':
                return current_node + ' ' + result

    # If we are here we found no cycle        
    return ''

def cycles(graph, min_depth = 3, max_depth = 5):
    cycles = {}
    for node, connections in graph.items():
        for connection in connections:
            visited = []
            # Let the recursion begin here
            result = cycles_rec(node, connection, graph, 1, visited, min_depth, max_depth)
            if result == '':
                continue 
            # Here we found a cycle.
            # Fingerprint is only necessary in order to not repeat the cycles found in the results
            # It could be ignored if repeating them is not important
            # It's based on the fact that nodes are all represented as letters here
            # It could be it's own function returning a hash for example if nodes have a more
            # complex representation
            fingerprint = ''.join(sorted(list(node + ' ' + result)))
            if fingerprint not in cycles.keys():
                cycles[fingerprint] = node + ' ' + result

    return list(cycles.values())

因此,假设您在示例中声明的图形变量:

print(cycles(graph, 3, 5))

会打印出来

['A D E G', 'F I H']

注意:此解决方案是对描述解决方案的扩展解决方案。我扩展到具有约 300 万个节点的原始图,我寻找至少 3 个节点且少于 40 个节点的所有循环,并将前 3 个循环存储到一个文件中。


我想到了以下解决方案。

# implementation of Johnson's cycle finding algorithm
# Original paper: Donald B Johnson. "Finding all the elementary circuits of a directed graph." SIAM Journal on Computing. 1975.

from collections import defaultdict

import networkx as nx
from networkx.utils import not_implemented_for, pairwise

@not_implemented_for("undirected")
def findCycles(G):
    """Find simple cycles of a directed graph.
    A `simple cycle` is a closed path where no node appears twice.
    Two elementary circuits are distinct if they are not cyclic permutations of each other.
    This is iterator/generator version of Johnson's algorithm [1]_.
    There may be better algorithms for some cases [2]_ [3]_.
    Parameters
    ----------
    G : NetworkX DiGraph
       A directed graph
    Returns
    -------
    cycle_generator: generator
       A generator that produces elementary cycles of the graph.
       Each cycle is represented by a list of nodes along the cycle.
    Examples
    --------
    >>> graph = {'A' : ['B','D', 'C'],
                 'B' : ['C'],
                 'C' : [],
                 'D' : ['E'],
                 'E' : ['G'],
                 'F' : ['A', 'I'],
                 'G' : ['A', 'K'],
                 'H' : ['F', 'G'],
                 'I' : ['H'],
                 'J' : ['A'],
                 'K' : []
                }
    >>> G = nx.DiGraph()
    >>> G.add_nodes_from(graph.keys())
    >>> for keys, values in graph.items():
            G.add_edges_from(([(keys, node) for node in values]))
    >>> list(nx.findCycles(G))
    [['F', 'I', 'H'], ['G', 'A', 'D', 'E']]
    Notes
    -----
    The implementation follows pp. 79-80 in [1]_.
    The time complexity is $O((n+e)(c+1))$ for $n$ nodes, $e$ edges and $c$
    elementary circuits.
    References
    ----------
    .. [1] Finding all the elementary circuits of a directed graph.
       D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
       https://doi.org/10.1137/0204007
    .. [2] Enumerating the cycles of a digraph: a new preprocessing strategy.
       G. Loizou and P. Thanish, Information Sciences, v. 27, 163-182, 1982.
    .. [3] A search strategy for the elementary cycles of a directed graph.
       J.L. Szwarcfiter and P.E. Lauer, BIT NUMERICAL MATHEMATICS,
       v. 16, no. 2, 192-204, 1976.
    --------
    """

    def _unblock(thisnode, blocked, B):
        stack = {thisnode}
        while stack:
            node = stack.pop()
            if node in blocked:
                blocked.remove(node)
                stack.update(B[node])
                B[node].clear()

    # Johnson's algorithm requires some ordering of the nodes.
    # We assign the arbitrary ordering given by the strongly connected comps
    # There is no need to track the ordering as each node removed as processed.
    # Also we save the actual graph so we can mutate it. We only take the
    # edges because we do not want to copy edge and node attributes here.
    subG = type(G)(G.edges())
    sccs = [scc for scc in nx.strongly_connected_components(subG) if len(scc) in list(range(3, 41))]

    # Johnson's algorithm exclude self cycle edges like (v, v)
    # To be backward compatible, we record those cycles in advance
    # and then remove from subG
    for v in subG:
        if subG.has_edge(v, v):
            yield [v]
            subG.remove_edge(v, v)

    while sccs:
        scc = sccs.pop()
        sccG = subG.subgraph(scc)
        # order of scc determines ordering of nodes
        startnode = scc.pop()
        # Processing node runs "circuit" routine from recursive version
        path = [startnode]
        blocked = set()  # vertex: blocked from search?
        closed = set()  # nodes involved in a cycle
        blocked.add(startnode)
        B = defaultdict(set)  # graph portions that yield no elementary circuit
        stack = [(startnode, list(sccG[startnode]))]  # sccG gives comp nbrs
        while stack:
            thisnode, nbrs = stack[-1]
            if nbrs:
                nextnode = nbrs.pop()
                if nextnode == startnode:
                    yield path[:]
                    closed.update(path)
                #                        print "Found a cycle", path, closed
                elif nextnode not in blocked:
                    path.append(nextnode)
                    stack.append((nextnode, list(sccG[nextnode])))
                    closed.discard(nextnode)
                    blocked.add(nextnode)
                    continue
            # done with nextnode... look for more neighbors
            if not nbrs:  # no more nbrs
                if thisnode in closed:
                    _unblock(thisnode, blocked, B)
                else:
                    for nbr in sccG[thisnode]:
                        if thisnode not in B[nbr]:
                            B[nbr].add(thisnode)
                stack.pop()
                path.pop()
        # done processing this node
        H = subG.subgraph(scc)  # make smaller to avoid work in SCC routine
        sccs.extend(scc for scc in nx.strongly_connected_components(H) if len(scc) in list(range(3, 41)))
import sys, csv, json

def findAllCycles(jsonInputFile, textOutFile):
    """Find simple cycles of a directed graph (jsonInputFile).
    Parameters:
    ----------
        jsonInputFile: a json file that has all concepts
        textOutFile: give a desired name of output file
    Returns:
    ----------
        a .text file (named: {textOutFile}.txt) has the first 3 cycles found in jsonInputFile
        Each cycle is represented by a list of nodes along the cycle
    """

    with open(jsonInputFile) as infile:
        graph = json.load(infile)
    # Convert the json file to a NetworkX directed graph
    G = nx.DiGraph()
    G.add_nodes_from(graph.keys())
    for keys, values in graph.items():
        G.add_edges_from(([(keys, node) for node in values]))
    # Search for all simple cycles existed in the graph
    _cycles = list(findCycles(G))
    # Start with an empty list and populate it by looping over all cycles
    # in _cycles that have at least 3 and less than 40 different concepts (nodes)
    cycles = []
    for cycle in _cycles:
        if len(cycle) in list(range(3, 41)):
            cycles.append(cycle)
    # Store the cycels under constraint in {textOutFile}.txt
    with open(textOutFile, 'w') as outfile:
        for cycle in cycles[:3]:
            outfile.write(','.join(n for n in cycle)+'\n')
        outfile.close()
    # When process finishes, print Done!!
    return 'Done!!'


infile = sys.argv[1]
outfile = sys.argv[2]
first_cycles = findAllCycles(infile, outfile)

要运行这个程序,你只需使用命令行如下:

>> python3 {program file name}.py graph.json {desired output file name}[.txt][.csv]

让,例如{所需的输出文件名}}.[txt][.csv],为first_3_cycles_found.txt

在我的例子中,该图有 3,378,546 个节点,我花了大约 40 分钟的时间使用上述代码找到所有循环。因此,输出文件将是:

如果您认为它需要任何改进或添加其他内容,请为此做出贡献。