如何使用多处理从带有 Beautiful Soup 的网页中提取链接?
How do I use multiprocessing to extract links from webpages with Beautiful Soup?
我有一个 link 的列表,我为每个 link 创建了一个 Beautiful Soup 对象,并从页面的段落标签中抓取所有 link。因为我有数百个 link 我想从中抓取,所以单个进程会花费比我想要的更多的时间,所以多处理似乎是理想的解决方案。
这是我的代码:
import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Queue
urls = ['https://hbr.org/2011/05/the-case-for-executive-assistants','https://signalvnoise.com/posts/3450-when-culture-turns-into-policy']
def collect_links(urls):
extracted_urls = []
bsoup_objects = []
p_tags = [] #store language between paragraph tags in each beautiful soup object
workers = 4
processes = []
links = Queue() #store links extracted from urls variable
web_connection = Queue() #store beautiful soup objects that are created for each url in urls variable
#dump each url from urls variable into links Queue for all processes to use
for url in urls:
links.put(url)
for w in xrange(workers):
p = Process(target = create_bsoup_object, args = (links, web_connection))
p.start()
processes.append(p)
links.put('STOP')
for p in processes:
p.join()
web_connection.put('STOP')
for beaut_soup_object in iter(web_connection.get, 'STOP'):
p_tags.append(beaut_soup_object.find_all('p'))
for paragraphs in p_tags:
bsoup_objects.append(BeautifulSoup(str(paragraphs)))
for beautiful_soup_object in bsoup_objects:
for link_tag in beautiful_soup_object.find_all('a'):
extracted_urls.append(link_tag.get('href'))
return extracted_urls
def create_bsoup_object(links, web_connection):
for link in iter(links.get, 'STOP'):
try:
web_connection.put(BeautifulSoup(requests.get(link, timeout=3.05).content))
except requests.exceptions.Timeout as e:
#client couldn't connect to server or return data in time period specified in timeout parameter in requests.get()
pass
except requests.exceptions.ConnectionError as e:
#in case of faulty url
pass
except Exception, err:
#catch regular errors
print(traceback.format_exc())
pass
except requests.exceptions.HTTPError as e:
pass
return True
当我 运行 collect_links(urls) 时,我得到的不是 link 的列表,而是一个包含以下错误的空列表:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
send(obj)
RuntimeError: maximum recursion depth exceeded while calling a Python object
[]
我不确定那指的是什么。我在某处读到队列最适合简单的对象。我存储在其中的漂亮汤对象的大小与此有什么关系吗?我将不胜感激任何见解。
您放置在队列中的对象需要是可腌制的。例如
import pickle
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://httpbin.org').text)
print type(soup)
p = pickle.dumps(soup)
此代码引发 RuntimeError: maximum recursion depth exceeded while calling a Python object
.
相反,您可以将实际的 HTML 文本放在队列中,然后通过主线程中的 BeautifulSoup 传递它。这仍然会提高性能,因为您的应用程序可能由于其网络组件而受到 I/O 的限制。
在 create_bsoup_object()
中执行此操作:
web_connection.put(requests.get(link, timeout=3.05).text)
这会将 HTML 而不是 BeautifulSoup 对象添加到队列中。然后在主进程中解析HTML
或者在子进程中解析和提取 URL,并将 extracted_urls
放入队列。
我有一个 link 的列表,我为每个 link 创建了一个 Beautiful Soup 对象,并从页面的段落标签中抓取所有 link。因为我有数百个 link 我想从中抓取,所以单个进程会花费比我想要的更多的时间,所以多处理似乎是理想的解决方案。
这是我的代码:
import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Queue
urls = ['https://hbr.org/2011/05/the-case-for-executive-assistants','https://signalvnoise.com/posts/3450-when-culture-turns-into-policy']
def collect_links(urls):
extracted_urls = []
bsoup_objects = []
p_tags = [] #store language between paragraph tags in each beautiful soup object
workers = 4
processes = []
links = Queue() #store links extracted from urls variable
web_connection = Queue() #store beautiful soup objects that are created for each url in urls variable
#dump each url from urls variable into links Queue for all processes to use
for url in urls:
links.put(url)
for w in xrange(workers):
p = Process(target = create_bsoup_object, args = (links, web_connection))
p.start()
processes.append(p)
links.put('STOP')
for p in processes:
p.join()
web_connection.put('STOP')
for beaut_soup_object in iter(web_connection.get, 'STOP'):
p_tags.append(beaut_soup_object.find_all('p'))
for paragraphs in p_tags:
bsoup_objects.append(BeautifulSoup(str(paragraphs)))
for beautiful_soup_object in bsoup_objects:
for link_tag in beautiful_soup_object.find_all('a'):
extracted_urls.append(link_tag.get('href'))
return extracted_urls
def create_bsoup_object(links, web_connection):
for link in iter(links.get, 'STOP'):
try:
web_connection.put(BeautifulSoup(requests.get(link, timeout=3.05).content))
except requests.exceptions.Timeout as e:
#client couldn't connect to server or return data in time period specified in timeout parameter in requests.get()
pass
except requests.exceptions.ConnectionError as e:
#in case of faulty url
pass
except Exception, err:
#catch regular errors
print(traceback.format_exc())
pass
except requests.exceptions.HTTPError as e:
pass
return True
当我 运行 collect_links(urls) 时,我得到的不是 link 的列表,而是一个包含以下错误的空列表:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
send(obj)
RuntimeError: maximum recursion depth exceeded while calling a Python object
[]
我不确定那指的是什么。我在某处读到队列最适合简单的对象。我存储在其中的漂亮汤对象的大小与此有什么关系吗?我将不胜感激任何见解。
您放置在队列中的对象需要是可腌制的。例如
import pickle
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://httpbin.org').text)
print type(soup)
p = pickle.dumps(soup)
此代码引发 RuntimeError: maximum recursion depth exceeded while calling a Python object
.
相反,您可以将实际的 HTML 文本放在队列中,然后通过主线程中的 BeautifulSoup 传递它。这仍然会提高性能,因为您的应用程序可能由于其网络组件而受到 I/O 的限制。
在 create_bsoup_object()
中执行此操作:
web_connection.put(requests.get(link, timeout=3.05).text)
这会将 HTML 而不是 BeautifulSoup 对象添加到队列中。然后在主进程中解析HTML
或者在子进程中解析和提取 URL,并将 extracted_urls
放入队列。