Python 解析站点给出 <html></html>

Question

有一个网站需要我分析但是，当我尝试对其进行分析时，我得到了响应 <html></html>

已尝试更改用户代理、cookie，但无济于事。

from bs4 import BeautifulSoup
import httpx

response = httpx.get('https://lolz.guru/market/')
soup = BeautifulSoup(response.text, 'lxml')

print(response.text)

Answer 1

如果该站点需要真实浏览器，您可以尝试让真实浏览器检索页面和数据。 Selenium 是一种旨在测试网络应用程序的工具，但本质上它可以运行脚本模仿用户与网络浏览器的交互，以便检查应用程序。

那里有很好的教程，也适用于 using Selenium from Python。

它也支持cookies：https://www.selenium.dev/documentation/webdriver/browser/cookies/

from selenium import webdriver

driver = webdriver.Chrome()

driver.get("http://www.example.com")

# Adds the cookie into current browser context
driver.add_cookie({"name": "key", "value": "value"})

Answer 2

你也可以用request_html，它有能力渲染JavaScript:

from bs4 import BeautifulSoup
from requests_html import HTMLSession


session = HTMLSession()
resp = session.get('https://lolz.guru/market/')

resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")

print(soup.text)
# print the whole page

您可以使用 pip 安装它：pip install requests-html

Python 解析站点给出 <html></html>

Python parsing the site gives <html></html>

python

beautifulsoup

httpx