python AsyncHTMLSession:您无权访问此服务器上的 "XXX"

python AsyncHTMLSession: You don't have permission to access "XXX" on this server

我想使用 requests_html 库中的 AsyncHTMLSession 访问 python 的站点。
这是我的代码:

from requests_html import AsyncHTMLSession
import asyncio

async def connect_to_site(url):
    session = AsyncHTMLSession()
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
    res = await session.get(url, headers=headers)
    print(res)
    await res.html.arender(sleep=5, timeout=30)
    print(res.html.full_text)

url = 'https://www.otcmarkets.com'

asyncio.run(connect_to_site(url))

执行代码后,打印如下:

<Response [200]>
Access Denied
Access Denied
You don't have permission to access "http://www.otcmarkets.com/" on this server.
Reference #18.9c4519d4.1643149046.338b64e3

可能是什么问题?我该如何克服它?

我认为这是某种机器人检测。但是requests_html可以渲染JS,不是真正的浏览器,不能完全绕过bot保护

您可以使用一些库来控制真正的浏览器,例如 playwright/selenium/puppeteer

这里是 playwright 的例子:

from playwright.sync_api import sync_playwright

URL = 'https://www.otcmarkets.com'

with sync_playwright() as p:
    # Webkit is fastest to start and hardest to detect
    browser = p.webkit.launch(headless=True)

    page = browser.new_page()
    page.goto(URL)

    html = page.content()

print(html)