阅读具有并行请求的网站?
Read websites with request parallel?
我想同时阅读 html- 网站内容并尝试使用以下代码 - 通常工作正常 -
resultText = {}
start = timeit.default_timer()
async def main():
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
None,
requests.get,
websites[i]
)
for i in range(22)
]
for i, response in enumerate(await asyncio.gather(*futures)):
# resultText.append(response.text)
resultText[websites[i]] = response.text
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
stop = timeit.default_timer()
print(f"Time for whole process: {round((stop-start)/60,2)} min")
for k,v in resultText.items():
print(k,len(v))
print(len(resultText))
但它似乎只适用于 22 个站点。 (当我将 for 循环从 22 更改为例如 23 时,它停止并出现以下问题)
Traceback (most recent call last):
File "C:\DEV\Fiverr\TRY\robalf\checkPages2.py", line 67, in <module>
loop.run_until_complete(main())
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 642, in run_until_complete
return future.result()
File "C:\DEV\Fiverr\TRY\robalf\checkPages2.py", line 62, in main
for i, response in enumerate(await asyncio.gather(*futures)):
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\concurrent\futures\thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'Eine vorhandene Verbindung wurde vom Remotehost geschlossen', None, 10054, None))
我怎样才能阅读超过 22 个网站? (没有必要并行读取所有 x 个站点——对我来说,运行 并行读取前 22 个站点就足够了——然后是接下来的 22 个,依此类推……)但我尝试循环异步工作流似乎我也遇到了上述错误。
您可以改用 httpx:
import httpx
async def get_stock_price_data(stock):
client = httpx.AsyncClient()
stock_page = await client.get( 'https://finance.yahoo.com/quote/TSLA')
这里有一篇详细描述 await/async 的完整文章:https://pythonhowtoprogram.com/python-await-async-tutorial-with-real-examples-and-simple-explanations/
我想同时阅读 html- 网站内容并尝试使用以下代码 - 通常工作正常 -
resultText = {}
start = timeit.default_timer()
async def main():
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
None,
requests.get,
websites[i]
)
for i in range(22)
]
for i, response in enumerate(await asyncio.gather(*futures)):
# resultText.append(response.text)
resultText[websites[i]] = response.text
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
stop = timeit.default_timer()
print(f"Time for whole process: {round((stop-start)/60,2)} min")
for k,v in resultText.items():
print(k,len(v))
print(len(resultText))
但它似乎只适用于 22 个站点。 (当我将 for 循环从 22 更改为例如 23 时,它停止并出现以下问题)
Traceback (most recent call last):
File "C:\DEV\Fiverr\TRY\robalf\checkPages2.py", line 67, in <module>
loop.run_until_complete(main())
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 642, in run_until_complete
return future.result()
File "C:\DEV\Fiverr\TRY\robalf\checkPages2.py", line 62, in main
for i, response in enumerate(await asyncio.gather(*futures)):
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\concurrent\futures\thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'Eine vorhandene Verbindung wurde vom Remotehost geschlossen', None, 10054, None))
我怎样才能阅读超过 22 个网站? (没有必要并行读取所有 x 个站点——对我来说,运行 并行读取前 22 个站点就足够了——然后是接下来的 22 个,依此类推……)但我尝试循环异步工作流似乎我也遇到了上述错误。
您可以改用 httpx:
import httpx
async def get_stock_price_data(stock):
client = httpx.AsyncClient()
stock_page = await client.get( 'https://finance.yahoo.com/quote/TSLA')
这里有一篇详细描述 await/async 的完整文章:https://pythonhowtoprogram.com/python-await-async-tutorial-with-real-examples-and-simple-explanations/