python AsyncHTMLSession：您无权访问此服务器上的 "XXX"

Question

我想使用 requests_html 库中的 AsyncHTMLSession 访问 python 的站点。
这是我的代码：

from requests_html import AsyncHTMLSession
import asyncio

async def connect_to_site(url):
    session = AsyncHTMLSession()
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
    res = await session.get(url, headers=headers)
    print(res)
    await res.html.arender(sleep=5, timeout=30)
    print(res.html.full_text)

url = 'https://www.otcmarkets.com'

asyncio.run(connect_to_site(url))

执行代码后，打印如下：

<Response [200]>
Access Denied
Access Denied
You don't have permission to access "http://www.otcmarkets.com/" on this server.
Reference #18.9c4519d4.1643149046.338b64e3

可能是什么问题？我该如何克服它？

Answer 1

我认为这是某种机器人检测。但是requests_html可以渲染JS，不是真正的浏览器，不能完全绕过bot保护

您可以使用一些库来控制真正的浏览器，例如 playwright/selenium/puppeteer

这里是 playwright 的例子：

from playwright.sync_api import sync_playwright

URL = 'https://www.otcmarkets.com'

with sync_playwright() as p:
    # Webkit is fastest to start and hardest to detect
    browser = p.webkit.launch(headless=True)

    page = browser.new_page()
    page.goto(URL)

    html = page.content()

print(html)

python AsyncHTMLSession：您无权访问此服务器上的 "XXX"

python AsyncHTMLSession: You don't have permission to access "XXX" on this server

python

python-requests-html