不能在基于 asyncio 构建的脚本中使用 https 代理以及重复使用同一会话

Question

我正在尝试在使用 asyncio 库的异步请求中使用 https 代理。当谈到使用 http 代理时，有一个明确的说明 here 但是我在使用 https 代理的情况下卡住了。此外，我想重复使用同一个会话，而不是每次发送请求时都创建一个新会话。

到目前为止我已经试过了 (proxies used within the script are directly taken from a free proxy site, so consider them as placeholders):

import asyncio
import aiohttp
from bs4 import BeautifulSoup

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

async def get_text(url):
    global proxies,proxy_url
    while True:
        check_url = proxy_url
        proxy = f'http://{proxy_url}'
        print("trying using:",check_url)
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(url,proxy=proxy,ssl=False) as resp:
                    return await resp.text()
            except Exception:
                if check_url == proxy_url:
                    proxy_url = proxies.pop()

async def field_info(field_link):              
    text = await get_text(field_link)          
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

if __name__ == '__main__':
    proxy_url = proxies.pop()
    links = ["https://whosebug.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
    loop.run_until_complete(future)
    loop.close()

如何在脚本中使用 https 代理并重复使用相同的 session？

Answer 1

此脚本创建字典 proxy_session_map，其中键是代理，值是会话。这样我们就知道哪个代理属于哪个会话。

如果使用代理时出现错误，我会将此代理添加到 disabled_proxies 集，这样我就不会再使用此代理了：

import asyncio
import aiohttp
from bs4 import BeautifulSoup

from random import choice

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

disabled_proxies = set()

proxy_session_map = {}

async def get_text(url):
    while True:
        try:
            available_proxies = [p for p in proxies if p not in disabled_proxies]

            if available_proxies:
                proxy = choice(available_proxies)
            else:
                proxy = None

            if proxy not in proxy_session_map:
                proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))

            print("trying using:",proxy)

            async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
                return await resp.text()

        except Exception as e:
            if proxy:
                print("error, disabling:",proxy)
                disabled_proxies.add(proxy)
            else:
                # we haven't used proxy, so return empty string
                return ''


async def field_info(field_link):
    text = await get_text(field_link)
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

async def main():
    links = ["https://whosebug.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    tasks = [field_info(url) for url in links]

    await asyncio.gather(
        *tasks
    )

    # close all sessions:
    for s in proxy_session_map.values():
        await s.close()

if __name__ == '__main__':
    asyncio.run(main())

打印（例如）：

trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrape instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?

... and so on.

不能在基于 asyncio 构建的脚本中使用 https 代理以及重复使用同一会话

Can't use https proxies along with reusing the same session within a script built upon asyncio

python

web-scraping

python-3.x

python-asyncio

aiohttp