在 Asyncio Web 抓取应用程序中放置 BeautifulSoup 代码的位置

Where to put BeautifulSoup code in Asyncio Web Scraping Application

我需要抓取许多(每天 5-10k)新闻文章正文段落的原始文本。我已经编写了一些线程代码,但考虑到该项目的高度 I/O 绑定性质,我正在涉足 asyncio。下面的代码片段并不比 1 线程版本快,而且比我的线程版本差得多。谁能告诉我我做错了什么?谢谢!

async def fetch(session,url):
    async with session.get(url) as response:
        return await response.text()

async def scrape_urls(urls):
    results = []
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            html = await fetch(session,url)
            soup = BeautifulSoup(html,'html.parser')
            body = soup.find('div', attrs={'class':'entry-content'})
            paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
            results.append(paras)
    return results

await 的意思是“等到结果准备好”,所以当你在每次循环迭代中等待获取时,你请求(并获得)顺序执行。要并行化提取,您需要使用 asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather 之类的便利函数为您将每个 fetch 生成到后台任务中。例如(未经测试):

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

def parse(html):
    soup = BeautifulSoup(html,'html.parser')
    body = soup.find('div', attrs={'class':'entry-content'})
    return [normalize('NFKD',para.get_text())
            for para in body.find_all('p')]

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    paras = parse(html)
    return paras

async def scrape_urls(urls):
    async with aiohttp.ClientSession() as session:
        return await asyncio.gather(
            *(fetch_and_parse(session, url) for url in urls)
        )

如果您发现这仍然比 multi-threaded 版本慢 运行s,可能是 HTML 的解析正在减慢 IO-related 的工作速度。 (默认情况下,Asyncio 运行 将所有内容都放在一个线程中。)为了防止 CPU-bound 代码干扰 asyncio,您可以使用 run_in_executor:[=21= 将解析移动到单独的线程中]

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate thread, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(None, parse, html)
    return paras

请注意,run_in_executor 必须等待,因为它 returns 一个在后台线程完成给定分配时被“唤醒”的等待对象。由于此版本使用 asyncio 进行 IO 和线程进行解析,因此它应该 运行 与您的线程版本差不多快,但可以扩展到更大数量的并行下载。

最后,如果你想运行实际上并行解析,使用多核,你可以使用multi-processing代替:

_pool = concurrent.futures.ProcessPoolExecutor()

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate process, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(pool, parse, html)
    return paras