不能在基于 asyncio 构建的脚本中使用 https 代理以及重复使用同一会话
Can't use https proxies along with reusing the same session within a script built upon asyncio
我正在尝试在使用 asyncio 库的异步请求中使用 https
代理。当谈到使用 http
代理时,有一个明确的说明 here 但是我在使用 https
代理的情况下卡住了。此外,我想重复使用同一个会话,而不是每次发送请求时都创建一个新会话。
到目前为止我已经试过了 (proxies used within the script are directly taken from a free proxy site, so consider them as placeholders
):
import asyncio
import aiohttp
from bs4 import BeautifulSoup
proxies = [
'http://89.22.210.191:41258',
'http://91.187.75.48:39405',
'http://103.81.104.66:34717',
'http://124.41.213.211:41828',
'http://93.191.100.231:3128'
]
async def get_text(url):
global proxies,proxy_url
while True:
check_url = proxy_url
proxy = f'http://{proxy_url}'
print("trying using:",check_url)
async with aiohttp.ClientSession() as session:
try:
async with session.get(url,proxy=proxy,ssl=False) as resp:
return await resp.text()
except Exception:
if check_url == proxy_url:
proxy_url = proxies.pop()
async def field_info(field_link):
text = await get_text(field_link)
soup = BeautifulSoup(text,'lxml')
for item in soup.select(".summary .question-hyperlink"):
print(item.get_text(strip=True))
if __name__ == '__main__':
proxy_url = proxies.pop()
links = ["https://whosebug.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
loop.run_until_complete(future)
loop.close()
如何在脚本中使用 https
代理并重复使用相同的 session
?
此脚本创建字典 proxy_session_map
,其中键是代理,值是会话。这样我们就知道哪个代理属于哪个会话。
如果使用代理时出现错误,我会将此代理添加到 disabled_proxies
集,这样我就不会再使用此代理了:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from random import choice
proxies = [
'http://89.22.210.191:41258',
'http://91.187.75.48:39405',
'http://103.81.104.66:34717',
'http://124.41.213.211:41828',
'http://93.191.100.231:3128'
]
disabled_proxies = set()
proxy_session_map = {}
async def get_text(url):
while True:
try:
available_proxies = [p for p in proxies if p not in disabled_proxies]
if available_proxies:
proxy = choice(available_proxies)
else:
proxy = None
if proxy not in proxy_session_map:
proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))
print("trying using:",proxy)
async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
return await resp.text()
except Exception as e:
if proxy:
print("error, disabling:",proxy)
disabled_proxies.add(proxy)
else:
# we haven't used proxy, so return empty string
return ''
async def field_info(field_link):
text = await get_text(field_link)
soup = BeautifulSoup(text,'lxml')
for item in soup.select(".summary .question-hyperlink"):
print(item.get_text(strip=True))
async def main():
links = ["https://whosebug.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
tasks = [field_info(url) for url in links]
await asyncio.gather(
*tasks
)
# close all sessions:
for s in proxy_session_map.values():
await s.close()
if __name__ == '__main__':
asyncio.run(main())
打印(例如):
trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrape instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?
... and so on.
我正在尝试在使用 asyncio 库的异步请求中使用 https
代理。当谈到使用 http
代理时,有一个明确的说明 here 但是我在使用 https
代理的情况下卡住了。此外,我想重复使用同一个会话,而不是每次发送请求时都创建一个新会话。
到目前为止我已经试过了 (proxies used within the script are directly taken from a free proxy site, so consider them as placeholders
):
import asyncio
import aiohttp
from bs4 import BeautifulSoup
proxies = [
'http://89.22.210.191:41258',
'http://91.187.75.48:39405',
'http://103.81.104.66:34717',
'http://124.41.213.211:41828',
'http://93.191.100.231:3128'
]
async def get_text(url):
global proxies,proxy_url
while True:
check_url = proxy_url
proxy = f'http://{proxy_url}'
print("trying using:",check_url)
async with aiohttp.ClientSession() as session:
try:
async with session.get(url,proxy=proxy,ssl=False) as resp:
return await resp.text()
except Exception:
if check_url == proxy_url:
proxy_url = proxies.pop()
async def field_info(field_link):
text = await get_text(field_link)
soup = BeautifulSoup(text,'lxml')
for item in soup.select(".summary .question-hyperlink"):
print(item.get_text(strip=True))
if __name__ == '__main__':
proxy_url = proxies.pop()
links = ["https://whosebug.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
loop.run_until_complete(future)
loop.close()
如何在脚本中使用 https
代理并重复使用相同的 session
?
此脚本创建字典 proxy_session_map
,其中键是代理,值是会话。这样我们就知道哪个代理属于哪个会话。
如果使用代理时出现错误,我会将此代理添加到 disabled_proxies
集,这样我就不会再使用此代理了:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from random import choice
proxies = [
'http://89.22.210.191:41258',
'http://91.187.75.48:39405',
'http://103.81.104.66:34717',
'http://124.41.213.211:41828',
'http://93.191.100.231:3128'
]
disabled_proxies = set()
proxy_session_map = {}
async def get_text(url):
while True:
try:
available_proxies = [p for p in proxies if p not in disabled_proxies]
if available_proxies:
proxy = choice(available_proxies)
else:
proxy = None
if proxy not in proxy_session_map:
proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))
print("trying using:",proxy)
async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
return await resp.text()
except Exception as e:
if proxy:
print("error, disabling:",proxy)
disabled_proxies.add(proxy)
else:
# we haven't used proxy, so return empty string
return ''
async def field_info(field_link):
text = await get_text(field_link)
soup = BeautifulSoup(text,'lxml')
for item in soup.select(".summary .question-hyperlink"):
print(item.get_text(strip=True))
async def main():
links = ["https://whosebug.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
tasks = [field_info(url) for url in links]
await asyncio.gather(
*tasks
)
# close all sessions:
for s in proxy_session_map.values():
await s.close()
if __name__ == '__main__':
asyncio.run(main())
打印(例如):
trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrape instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?
... and so on.