广度优先链接爬取
Breadth first links crawling
假设我有一个网站:
- http://example.com/home 链接到 /page1、/page2、/page3
- http://example.com/page1 链接到 /page2368
- http://example.com/page2 链接到 /page41
- http://example.com/page2368 链接到 /page999990、/page999991、...、/page999999
现在使用时:
from bs4 import BeautifulSoup
def visit(url, recursion=0):
links = getpage_and_retrieve_ahref_links(url) # using beautifulsoup
for link in links:
if recursion < 10: # limit recursion to 10
visit(link, recursion+1)
visit('http://example.com/home')
它会在访问 /page2 之前访问 /home、/page1、/page2368、/page999990、/page999991、...、/page999999。简而言之,它进行的是深度优先遍历,而不是(我想要的)广度优先遍历。
如何修改之前的代码进行广度优先访问,即所有visit
先用calls recursion=1
调用,然后visit
用[=14调用=],等等?
应该依次访问/home、/page1、/page2、/page3、/page2368、/page41、/page999990等。
您可以使用双端队列(deque
in Python)通过在 n 级链接之后追加 n+1 级链接来进行广度优先搜索。
from collections import deque
def bfs_visit(url, max_level=10):
queue = deque([url])
level = 0
while queue and level < max_level:
url = queue.popleft()
visit_no_recur(url) # only visits the page
links = get_links(url) # get links, maybe parse the result of last statement
queue.append(links)
level += 1
bfs_visit('http://example.com/home')
在给定的示例中,队列将如下所示:
['/home']
=> popleft /home (i.e. the next page to be visited is /home)
=> add new links on right
['/page1', '/page2', '/page3']
=> popleft /page1
=> add new links on right
['/page2', '/page3', '/page2368']
=> popleft /page2
=> add new links on right
['/page3', '/page2368', '/page41']
=> popleft /page3
['/page2368', '/page41']
=> popleft /page2368
=> add new links on right
['/page41', '/page999990', '/page999991', ..., '/page999999']
...
假设我有一个网站:
- http://example.com/home 链接到 /page1、/page2、/page3
- http://example.com/page1 链接到 /page2368
- http://example.com/page2 链接到 /page41
- http://example.com/page2368 链接到 /page999990、/page999991、...、/page999999
现在使用时:
from bs4 import BeautifulSoup
def visit(url, recursion=0):
links = getpage_and_retrieve_ahref_links(url) # using beautifulsoup
for link in links:
if recursion < 10: # limit recursion to 10
visit(link, recursion+1)
visit('http://example.com/home')
它会在访问 /page2 之前访问 /home、/page1、/page2368、/page999990、/page999991、...、/page999999。简而言之,它进行的是深度优先遍历,而不是(我想要的)广度优先遍历。
如何修改之前的代码进行广度优先访问,即所有visit
先用calls recursion=1
调用,然后visit
用[=14调用=],等等?
应该依次访问/home、/page1、/page2、/page3、/page2368、/page41、/page999990等。
您可以使用双端队列(deque
in Python)通过在 n 级链接之后追加 n+1 级链接来进行广度优先搜索。
from collections import deque
def bfs_visit(url, max_level=10):
queue = deque([url])
level = 0
while queue and level < max_level:
url = queue.popleft()
visit_no_recur(url) # only visits the page
links = get_links(url) # get links, maybe parse the result of last statement
queue.append(links)
level += 1
bfs_visit('http://example.com/home')
在给定的示例中,队列将如下所示:
['/home']
=> popleft /home (i.e. the next page to be visited is /home)
=> add new links on right
['/page1', '/page2', '/page3']
=> popleft /page1
=> add new links on right
['/page2', '/page3', '/page2368']
=> popleft /page2
=> add new links on right
['/page3', '/page2368', '/page41']
=> popleft /page3
['/page2368', '/page41']
=> popleft /page2368
=> add new links on right
['/page41', '/page999990', '/page999991', ..., '/page999999']
...