广度优先链接爬取

Breadth first links crawling

假设我有一个网站:

现在使用时:

from bs4 import BeautifulSoup

def visit(url, recursion=0):
    links = getpage_and_retrieve_ahref_links(url)   # using beautifulsoup
    for link in links:
        if recursion < 10:         # limit recursion to 10 
            visit(link, recursion+1)

visit('http://example.com/home')

它会在访问 /page2 之前访问 /home、/page1、/page2368、/page999990、/page999991、...、/page999999。简而言之,它进行的是深度优先遍历,而不是(我想要的)广度优先遍历。

如何修改之前的代码进行广度优先访问,即所有visit先用calls recursion=1调用,然后visit用[=14调用=],等等?

应该依次访问/home、/page1、/page2、/page3、/page2368、/page41、/page999990等。

您可以使用双端队列(deque in Python)通过在 n 级链接之后追加 n+1 级链接来进行广度优先搜索。

from collections import deque

def bfs_visit(url, max_level=10):
    queue = deque([url])
    level = 0

    while queue and level < max_level:
        url = queue.popleft()
        visit_no_recur(url)  # only visits the page

        links = get_links(url)  # get links, maybe parse the result of last statement
        queue.append(links)
        level += 1 

bfs_visit('http://example.com/home')

在给定的示例中,队列将如下所示:

['/home']  
    => popleft /home (i.e. the next page to be visited is /home)
    => add new links on right
['/page1', '/page2', '/page3'] 
    => popleft /page1
    => add new links on right
['/page2', '/page3', '/page2368']
    => popleft /page2
    => add new links on right
['/page3', '/page2368', '/page41']
    => popleft /page3
['/page2368', '/page41']
    => popleft /page2368
    => add new links on right
['/page41', '/page999990', '/page999991', ..., '/page999999']
    ...