在 Asyncio Web 抓取应用程序中放置 BeautifulSoup 代码的位置

Question

我需要抓取许多（每天 5-10k）新闻文章正文段落的原始文本。我已经编写了一些线程代码，但考虑到该项目的高度 I/O 绑定性质，我正在涉足 asyncio。下面的代码片段并不比 1 线程版本快，而且比我的线程版本差得多。谁能告诉我我做错了什么？谢谢！

async def fetch(session,url):
    async with session.get(url) as response:
        return await response.text()

async def scrape_urls(urls):
    results = []
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            html = await fetch(session,url)
            soup = BeautifulSoup(html,'html.parser')
            body = soup.find('div', attrs={'class':'entry-content'})
            paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
            results.append(paras)
    return results

Answer 1

await 的意思是“等到结果准备好”，所以当你在每次循环迭代中等待获取时，你请求（并获得）顺序执行。要并行化提取，您需要使用 asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather 之类的便利函数为您将每个 fetch 生成到后台任务中。例如（未经测试）：

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

def parse(html):
    soup = BeautifulSoup(html,'html.parser')
    body = soup.find('div', attrs={'class':'entry-content'})
    return [normalize('NFKD',para.get_text())
            for para in body.find_all('p')]

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    paras = parse(html)
    return paras

async def scrape_urls(urls):
    async with aiohttp.ClientSession() as session:
        return await asyncio.gather(
            *(fetch_and_parse(session, url) for url in urls)
        )

如果您发现这仍然比 multi-threaded 版本慢运行s，可能是 HTML 的解析正在减慢 IO-related 的工作速度。（默认情况下，Asyncio 运行将所有内容都放在一个线程中。）为了防止 CPU-bound 代码干扰 asyncio，您可以使用 run_in_executor:[=21= 将解析移动到单独的线程中]

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate thread, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(None, parse, html)
    return paras

请注意，run_in_executor 必须等待，因为它 returns 一个在后台线程完成给定分配时被“唤醒”的等待对象。由于此版本使用 asyncio 进行 IO 和线程进行解析，因此它应该运行与您的线程版本差不多快，但可以扩展到更大数量的并行下载。

最后，如果你想运行实际上并行解析，使用多核，你可以使用multi-processing代替：

_pool = concurrent.futures.ProcessPoolExecutor()

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate process, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(pool, parse, html)
    return paras

在 Asyncio Web 抓取应用程序中放置 BeautifulSoup 代码的位置

Where to put BeautifulSoup code in Asyncio Web Scraping Application

python

asynchronous

beautifulsoup

python-asyncio

aiohttp