在 Python 中提高 BFS 的性能

Question

如何提高以下 Python 代码的速度性能？

我的代码工作正常，这意味着没有错误，但这段代码的性能非常慢。

输入数据为Facebook Large Page-Page Network数据集，您可以在这里访问数据集：(http://snap.stanford.edu/data/facebook-large-page-page-network.html)

问题定义：

检查两个节点之间的距离是否小于max_distance

我的约束：

我必须导入一个 .txt 文件，其格式类似于 sample_input
预期输出就像 sample_output
总代码运行时间应少于 5 秒。

任何人都可以给我建议以更好地改进我的代码吗？按照我的代码：

from collections import deque

class Graph:
    def __init__(self, filename):
        self.filename = filename
        self.graph = {}
        with open(self.filename) as input_data:
            for line in input_data:
                key, val = line.strip().split(',')
                self.graph[key] = self.graph.get(key, []) + [val]

    def check_distance(self, x, y, max_distance):          
        dist = self.path(x, y, max_distance)
        if dist:
            return dist - 1 <= max_distance
        else:
            return False

    def path(self, x, y, max_distance):
        start, end = str(x), str(y)
        queue = deque([start])
        while queue:
            path = queue.popleft()
            node = path[-1]
            if node == end:
                return len(path)
            elif len(path) > max_distance:
                return False
            else:
                for adjacent in self.graph.get(node, []):
                    queue.append(list(path) + [adjacent])

提前感谢您的帮助。

Answer 1

几个要点：

如果您多次调用检查距离，则必须重新创建图形
调用 queue.pop(0) 在 python 中的标准列表上效率低下，请使用集合模块中的 deque 之类的东西。 see here
正如 DarrylG 指出的那样，一旦路径超过最大距离，您就可以提前退出 BFS

你可以试试

from collections import deque

class Graph:
    def __init__(self, filename):
        self.filename = filename
        self.graph = self.file_to_graph()

    def file_to_graph(self):
        graph = {}
        with open(self.filename) as input_data:
            for line in input_data:
                key, val = line.strip().split(',')
                graph[key] = graph.get(key, []) + [val]
        return graph

    def check_distance(self, x, y, max_distance):          
        path_length = self.path(x, y, max_distance)
        if path_length:
            return len(path) - 1 <= max_distance
        else:
            return False

    def path(self, x, y, max_distance):
        start, end = str(x), str(y)
        queue = deque([start])
        while queue:
            path = queue.popleft()
            node = path[-1]
            if node == end:
                return len(path)
            elif len(path) > max_distance:
                # we have explored all paths shorter than the max distance
                return False
            else:
                for adjacent in self.graph.get(node, []):
                    queue.append(list(path) + [adjacent])

关于为什么 pop(0) 效率低下 - 来自文档：

Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

Answer 2

关于方法：

您创建了一个图表，并在图表中对一个元素和另一个元素进行了多次比较。每次你运行你的 BFS 算法。这将每次产生 O|E+V| 的成本，或者，您需要一次又一次地计算每次距离。这不是一个好方法。

我推荐的。运行一种 Dijkstra 算法（获取 2 个节点之间的最小距离并将信息存储在邻接矩阵中。您需要做的只是获取此邻接矩阵中的计算信息，该邻接矩阵将包含您的所有最小距离图，你需要的是消耗上一步计算的距离

关于算法

我建议您寻找 DFS/BFS 的不同方法。

如果您要比较所有节点，我相信 Dijkstra 算法 在您的情况下会更有效，因为它们标记了已访问的路径。(https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm)。您只能修改和调用算法一次。

您需要检查的另一件事是。您的图表包含循环？如果是，您需要对循环应用一些控制，您需要检查 Ford Fulkerson Algorithm (https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm)

据我了解。每次您想将一个节点与另一个节点进行比较时，您运行再次使用您的算法。如果你的图表上有 1000 个元素，你的比较，每次都会访问 999 个节点来检查这个。

如果您实施 Dijkstra 并存储距离，则您运行整个网络只需一次并将距离保存在内存中。

下一步是从内存中收集可以放入数组的距离。

您可以将所有距离保存在邻接矩阵 (http://opendatastructures.org/versions/edition-0.1e/ods-java/12_1_AdjacencyMatrix_Repres.html) 上，并且每次只使用几次信息而不会产生计算负担。

在 Python 中提高 BFS 的性能

Raising performance of BFS in Python

python

breadth-first-search

关于方法：

关于算法