在 Asyncio Web 抓取应用程序中放置 BeautifulSoup 代码的位置
Where to put BeautifulSoup code in Asyncio Web Scraping Application
我需要抓取许多(每天 5-10k)新闻文章正文段落的原始文本。我已经编写了一些线程代码,但考虑到该项目的高度 I/O 绑定性质,我正在涉足 asyncio
。下面的代码片段并不比 1 线程版本快,而且比我的线程版本差得多。谁能告诉我我做错了什么?谢谢!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results
await
的意思是“等到结果准备好”,所以当你在每次循环迭代中等待获取时,你请求(并获得)顺序执行。要并行化提取,您需要使用 asyncio.create_task(fetch(...))
, and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather
之类的便利函数为您将每个 fetch
生成到后台任务中。例如(未经测试):
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(html):
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
return [normalize('NFKD',para.get_text())
for para in body.find_all('p')]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
paras = parse(html)
return paras
async def scrape_urls(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(session, url) for url in urls)
)
如果您发现这仍然比 multi-threaded 版本慢 运行s,可能是 HTML 的解析正在减慢 IO-related 的工作速度。 (默认情况下,Asyncio 运行 将所有内容都放在一个线程中。)为了防止 CPU-bound 代码干扰 asyncio,您可以使用 run_in_executor
:[=21= 将解析移动到单独的线程中]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate thread, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(None, parse, html)
return paras
请注意,run_in_executor
必须等待,因为它 returns 一个在后台线程完成给定分配时被“唤醒”的等待对象。由于此版本使用 asyncio 进行 IO 和线程进行解析,因此它应该 运行 与您的线程版本差不多快,但可以扩展到更大数量的并行下载。
最后,如果你想运行实际上并行解析,使用多核,你可以使用multi-processing代替:
_pool = concurrent.futures.ProcessPoolExecutor()
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate process, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(pool, parse, html)
return paras
我需要抓取许多(每天 5-10k)新闻文章正文段落的原始文本。我已经编写了一些线程代码,但考虑到该项目的高度 I/O 绑定性质,我正在涉足 asyncio
。下面的代码片段并不比 1 线程版本快,而且比我的线程版本差得多。谁能告诉我我做错了什么?谢谢!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results
await
的意思是“等到结果准备好”,所以当你在每次循环迭代中等待获取时,你请求(并获得)顺序执行。要并行化提取,您需要使用 asyncio.create_task(fetch(...))
, and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather
之类的便利函数为您将每个 fetch
生成到后台任务中。例如(未经测试):
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(html):
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
return [normalize('NFKD',para.get_text())
for para in body.find_all('p')]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
paras = parse(html)
return paras
async def scrape_urls(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(session, url) for url in urls)
)
如果您发现这仍然比 multi-threaded 版本慢 运行s,可能是 HTML 的解析正在减慢 IO-related 的工作速度。 (默认情况下,Asyncio 运行 将所有内容都放在一个线程中。)为了防止 CPU-bound 代码干扰 asyncio,您可以使用 run_in_executor
:[=21= 将解析移动到单独的线程中]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate thread, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(None, parse, html)
return paras
请注意,run_in_executor
必须等待,因为它 returns 一个在后台线程完成给定分配时被“唤醒”的等待对象。由于此版本使用 asyncio 进行 IO 和线程进行解析,因此它应该 运行 与您的线程版本差不多快,但可以扩展到更大数量的并行下载。
最后,如果你想运行实际上并行解析,使用多核,你可以使用multi-processing代替:
_pool = concurrent.futures.ProcessPoolExecutor()
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate process, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(pool, parse, html)
return paras