为什么 aiohttp/asyncio 在一定数量的 url 后停止 运行?

Why aiohttp/asyncio stops running after a certain number of urls?

当我 运行 下面的代码时,如果我将 url 的数量切片为 10 000,它会在 url 编号 9983 停止工作,但是没有显示错误在终端中,代码只是停止 运行ning(就像冻结一样)。

同样的行为,如果我将 url 的数量分割为 5000,它会在到达第 5000 个 url 之前停止 运行ning。

奇怪的是,如果我将 url 的列表切片为 1000 个 url,代码就可以工作。

我真的不知道问题出在哪里,我想这与 aiohttp 或 asyncio 的参数有关,我必须在某处添加以增加授权请求的数量。

这是我当前的代码:

import asyncio
import time

import aiohttp
import pandas as pd


found = 0
not_found = 0
counter = 0

async def download_site(session, url):
    global found, not_found, counter
    async with session.get(url) as response:
        if str(response.url) == 'https://fake.notfound.url.com':
            print('\n\n', response.url, '\n\n')
            not_found += 1
        else:
            found += 1
        counter += 1
        print(counter)


async def download_all_sites(sites):
    session_timeout = aiohttp.ClientTimeout(total=None)
    async with aiohttp.ClientSession(timeout=session_timeout) as session:
        tasks = []
        for url in sites:
            task = asyncio.ensure_future(download_site(session, url))
            tasks.append(task)
        await asyncio.gather(*tasks)


if __name__ == "__main__":

    df = pd.read_csv('database_table.csv', sep=';', encoding='utf-8')

    sites = df['urls'].tolist()

    start_time = time.time()
    asyncio.get_event_loop().run_until_complete(download_all_sites(sites[7000:17000]))
    duration = time.time() - start_time

    print(f'404: {not_found / len(sites[7000:17000]) * 100} %')
    print(f'200: {found / len(sites[7000:17000]) * 100} %')

很长一段时间后,我按下 Ctrl+C,我得到了这个错误跟踪:

^CTraceback (most recent call last):
  File "/home/takamura/Documents/corp/scripts/misc_scripts/links_checker.py", line 74, in <module>
    
  File "/usr/lib/python3.9/asyncio/base_events.py", line 629, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
    self._run_once()
  File "/usr/lib/python3.9/asyncio/base_events.py", line 1854, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/lib/python3.9/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Exception ignored in: <coroutine object download_all_sites at 0x7fc348b856c0>
RuntimeError: coroutine ignored GeneratorExit
Task was destroyed but it is pending!
task: <Task pending name='Task-609' coro=<download_site() running at /home/takamura/Documents/corp/scripts/misc_scripts/links_checker.py:41> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fc348101520>()]> cb=[gather.<locals>._done_callback() at /usr/lib/python3.9/asyncio/tasks.py:766]>

我做错了什么?

添加semaphore like in this code + 使用

connector = aiohttp.TCPConnector(limit=80)
async with aiohttp.ClientSession(connector=connector) as session:
...

解决了我的问题。