为什么 aiohttp/asyncio 在一定数量的 url 后停止 运行?
Why aiohttp/asyncio stops running after a certain number of urls?
当我 运行 下面的代码时,如果我将 url 的数量切片为 10 000,它会在 url 编号 9983 停止工作,但是没有显示错误在终端中,代码只是停止 运行ning(就像冻结一样)。
同样的行为,如果我将 url 的数量分割为 5000,它会在到达第 5000 个 url 之前停止 运行ning。
奇怪的是,如果我将 url 的列表切片为 1000 个 url,代码就可以工作。
我真的不知道问题出在哪里,我想这与 aiohttp 或 asyncio 的参数有关,我必须在某处添加以增加授权请求的数量。
这是我当前的代码:
import asyncio
import time
import aiohttp
import pandas as pd
found = 0
not_found = 0
counter = 0
async def download_site(session, url):
global found, not_found, counter
async with session.get(url) as response:
if str(response.url) == 'https://fake.notfound.url.com':
print('\n\n', response.url, '\n\n')
not_found += 1
else:
found += 1
counter += 1
print(counter)
async def download_all_sites(sites):
session_timeout = aiohttp.ClientTimeout(total=None)
async with aiohttp.ClientSession(timeout=session_timeout) as session:
tasks = []
for url in sites:
task = asyncio.ensure_future(download_site(session, url))
tasks.append(task)
await asyncio.gather(*tasks)
if __name__ == "__main__":
df = pd.read_csv('database_table.csv', sep=';', encoding='utf-8')
sites = df['urls'].tolist()
start_time = time.time()
asyncio.get_event_loop().run_until_complete(download_all_sites(sites[7000:17000]))
duration = time.time() - start_time
print(f'404: {not_found / len(sites[7000:17000]) * 100} %')
print(f'200: {found / len(sites[7000:17000]) * 100} %')
很长一段时间后,我按下 Ctrl+C
,我得到了这个错误跟踪:
^CTraceback (most recent call last):
File "/home/takamura/Documents/corp/scripts/misc_scripts/links_checker.py", line 74, in <module>
File "/usr/lib/python3.9/asyncio/base_events.py", line 629, in run_until_complete
self.run_forever()
File "/usr/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
self._run_once()
File "/usr/lib/python3.9/asyncio/base_events.py", line 1854, in _run_once
event_list = self._selector.select(timeout)
File "/usr/lib/python3.9/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Exception ignored in: <coroutine object download_all_sites at 0x7fc348b856c0>
RuntimeError: coroutine ignored GeneratorExit
Task was destroyed but it is pending!
task: <Task pending name='Task-609' coro=<download_site() running at /home/takamura/Documents/corp/scripts/misc_scripts/links_checker.py:41> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fc348101520>()]> cb=[gather.<locals>._done_callback() at /usr/lib/python3.9/asyncio/tasks.py:766]>
我做错了什么?
当我 运行 下面的代码时,如果我将 url 的数量切片为 10 000,它会在 url 编号 9983 停止工作,但是没有显示错误在终端中,代码只是停止 运行ning(就像冻结一样)。
同样的行为,如果我将 url 的数量分割为 5000,它会在到达第 5000 个 url 之前停止 运行ning。
奇怪的是,如果我将 url 的列表切片为 1000 个 url,代码就可以工作。
我真的不知道问题出在哪里,我想这与 aiohttp 或 asyncio 的参数有关,我必须在某处添加以增加授权请求的数量。
这是我当前的代码:
import asyncio
import time
import aiohttp
import pandas as pd
found = 0
not_found = 0
counter = 0
async def download_site(session, url):
global found, not_found, counter
async with session.get(url) as response:
if str(response.url) == 'https://fake.notfound.url.com':
print('\n\n', response.url, '\n\n')
not_found += 1
else:
found += 1
counter += 1
print(counter)
async def download_all_sites(sites):
session_timeout = aiohttp.ClientTimeout(total=None)
async with aiohttp.ClientSession(timeout=session_timeout) as session:
tasks = []
for url in sites:
task = asyncio.ensure_future(download_site(session, url))
tasks.append(task)
await asyncio.gather(*tasks)
if __name__ == "__main__":
df = pd.read_csv('database_table.csv', sep=';', encoding='utf-8')
sites = df['urls'].tolist()
start_time = time.time()
asyncio.get_event_loop().run_until_complete(download_all_sites(sites[7000:17000]))
duration = time.time() - start_time
print(f'404: {not_found / len(sites[7000:17000]) * 100} %')
print(f'200: {found / len(sites[7000:17000]) * 100} %')
很长一段时间后,我按下 Ctrl+C
,我得到了这个错误跟踪:
^CTraceback (most recent call last):
File "/home/takamura/Documents/corp/scripts/misc_scripts/links_checker.py", line 74, in <module>
File "/usr/lib/python3.9/asyncio/base_events.py", line 629, in run_until_complete
self.run_forever()
File "/usr/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
self._run_once()
File "/usr/lib/python3.9/asyncio/base_events.py", line 1854, in _run_once
event_list = self._selector.select(timeout)
File "/usr/lib/python3.9/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Exception ignored in: <coroutine object download_all_sites at 0x7fc348b856c0>
RuntimeError: coroutine ignored GeneratorExit
Task was destroyed but it is pending!
task: <Task pending name='Task-609' coro=<download_site() running at /home/takamura/Documents/corp/scripts/misc_scripts/links_checker.py:41> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fc348101520>()]> cb=[gather.<locals>._done_callback() at /usr/lib/python3.9/asyncio/tasks.py:766]>
我做错了什么?