python AsyncHTMLSession:您无权访问此服务器上的 "XXX"
python AsyncHTMLSession: You don't have permission to access "XXX" on this server
我想使用 requests_html
库中的 AsyncHTMLSession
访问 python 的站点。
这是我的代码:
from requests_html import AsyncHTMLSession
import asyncio
async def connect_to_site(url):
session = AsyncHTMLSession()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
res = await session.get(url, headers=headers)
print(res)
await res.html.arender(sleep=5, timeout=30)
print(res.html.full_text)
url = 'https://www.otcmarkets.com'
asyncio.run(connect_to_site(url))
执行代码后,打印如下:
<Response [200]>
Access Denied
Access Denied
You don't have permission to access "http://www.otcmarkets.com/" on this server.
Reference #18.9c4519d4.1643149046.338b64e3
可能是什么问题?我该如何克服它?
我认为这是某种机器人检测。但是requests_html
可以渲染JS,不是真正的浏览器,不能完全绕过bot保护
您可以使用一些库来控制真正的浏览器,例如 playwright
/selenium
/puppeteer
这里是 playwright
的例子:
from playwright.sync_api import sync_playwright
URL = 'https://www.otcmarkets.com'
with sync_playwright() as p:
# Webkit is fastest to start and hardest to detect
browser = p.webkit.launch(headless=True)
page = browser.new_page()
page.goto(URL)
html = page.content()
print(html)
我想使用 requests_html
库中的 AsyncHTMLSession
访问 python 的站点。
这是我的代码:
from requests_html import AsyncHTMLSession
import asyncio
async def connect_to_site(url):
session = AsyncHTMLSession()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
res = await session.get(url, headers=headers)
print(res)
await res.html.arender(sleep=5, timeout=30)
print(res.html.full_text)
url = 'https://www.otcmarkets.com'
asyncio.run(connect_to_site(url))
执行代码后,打印如下:
<Response [200]>
Access Denied
Access Denied
You don't have permission to access "http://www.otcmarkets.com/" on this server.
Reference #18.9c4519d4.1643149046.338b64e3
可能是什么问题?我该如何克服它?
我认为这是某种机器人检测。但是requests_html
可以渲染JS,不是真正的浏览器,不能完全绕过bot保护
您可以使用一些库来控制真正的浏览器,例如 playwright
/selenium
/puppeteer
这里是 playwright
的例子:
from playwright.sync_api import sync_playwright
URL = 'https://www.otcmarkets.com'
with sync_playwright() as p:
# Webkit is fastest to start and hardest to detect
browser = p.webkit.launch(headless=True)
page = browser.new_page()
page.goto(URL)
html = page.content()
print(html)