维基百科页面上的 BFS 需要很长时间 - 有人可以帮我分析我的代码的运行时间吗？

Question

我正在尝试在维基百科页面上执行 BFS。我相信我正在以运行时间明智的最佳方式正确地实现这一点（将其保持在一个线程中），但是要花很长时间才能找到两篇文章之间的联系。这是我的实现：

marked = set()
queue = deque()
count = 0

def get_wikipedia_page(wiki_link):
    url = BASE_URL + wiki_link
    time.sleep(0.1)
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    return soup

def backtrace(parent, start, end):
    boolean = False
    ret = []
    while boolean == False:
        if end in parent:
            ret.append(parent[end])
            end = parent[end]
            if end == start:
                break;
    ret.reverse()
    return ret

def bfs(first_wiki_link, second_wiki_link):
    print('Searching starting from ' + first_wiki_link + ' looking for ' + second_wiki_link + '...')
    parent = {}
    queue.append(first_wiki_link)
    start_time = time.time()
    #current_parent = first_wiki_link
    while queue:
        link = queue.pop()
        current_parent = link
        link_list = list(filter(lambda c: c.get('href') != None and c.get('href').startswith('/wiki/') and c.get('href') != '/wiki/Main_Page' and ':' not in c.get('href'), get_wikipedia_page(link).findAll('a')))
        for link in link_list:
            href = link.get('href')
            if not marked.__contains__(href):
                parent[href] = current_parent
                marked.add(href)
                queue.append(href)
            if href == second_wiki_link:
                ret = backtrace(parent, first_wiki_link, second_wiki_link)
                print("connection found")
                end_time = time.time()
                print("found a connection in: " + str(end_time - start_time) + " seconds.")
                print("the path is " + str(len(ret)) + " pages long")
                print(ret)
                return True

有时需要几分钟才能找到匹配项。由于维基百科有多大，这是预期的吗？还是我在这里搞砸了一些东西并且以非最佳方式执行它？或者，会不会是美汤运行宁太慢了？

Answer 1

您正在有效地执行 dfs，而不是 bfs。像维基百科内容这样的大型图表中的 dfs 可能需要很长时间才能查看节点的近邻，并且很可能会导致更长时间才能找到连接，因为这两个页面很可能以某种方式相关，因此关闭。

这部分代码：

    while queue:
        link = queue.pop() # <---- look at this
        current_parent = link
        link_list = list(filter...
        for link in link_list:

这使它成为一个 dfs，因为您正在像堆栈一样使用 queue。 bfs 需要一个 FIFO（先进先出）数据结构，而你做的是 LIFO（后进先出），这就是 dfs 所做的。

这意味着在您查看了维基百科上的每一页之后，第一页的第一个相邻页面将被查看。

求解：

    while queue:
        link = queue.popleft() # <---- look at this
        current_parent = link
        link_list = list(filter...
        for link in link_list:

这将首先查看邻居，然后按照您的预期进行广度优先搜索。

维基百科页面上的 BFS 需要很长时间 - 有人可以帮我分析我的代码的运行时间吗？

BFS on wikipedia pages is taking very long - Can someone help me analyze my code's runtime?

python

algorithm

graph-theory

breadth-first-search