如何从仅使用一个 url 的网站抓取数据

Question

我是一名学生，我正在尝试从我们的在线注册中抓取数据，以便 discord 机器人可以在 discord 上发送信息，该网站需要登录，我可以使用此代码完成：

import requests
from lxml import html


session_requests = requests.session()
login_url ="url"
result = session_requests.get(login_url)


payload = {
    "txtUser": "user", 
    "txtPassword": "pass"}


tree = html.fromstring(result.text)

result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
        )

但是当我开始抓取数据时我遇到了一个问题：该网站只使用一个 url，为了更好地解释它，想象一下如果您需要从 whosebug.com 抓取数据但是 [=地址栏中的 20=] 始终只是 whosebug.com/ 即使您访问网站的其他页面，如提问或悬赏问题页面

我不知道如何从类似的东西中检索数据

Answer 1

如果网站的行为类似于 single-page 应用程序，我可以想出两种方法来解决问题：

选项 1：尝试 reverse-engineer API 调用网页。在 Chrome 中打开站点并打开开发人员工具 (CTRL+SHIFT+I) 然后在站点周围单击时查看“网络”选项卡。它应该向您显示该页面向其服务器发出的所有请求。根据它的复杂程度，这可能很容易理解，也可能完全难以理解。也许您可以找到 API 个像 www.school.edu/classinfo/1234 这样的端点，您可以直接使用它们来获取数据。使用像 Postman 这样的工具，看看是否可以重新创建一些 API 调用。如果您在几分钟内没有任何好的线索，请转到选项 2。

选项 2：查看 Selenium. The most common use for Selenium is automated testing for websites, but you can also use it with Python to perform actions on a webpage, and then interrogate the resulting document state. (ex: open this site; find the text field with the id "studentid"; type my student id into that field; find the button with the id "viewschedule"; click it; find the div with the id "schedule"; return the text from inside that div). Some good places to start are the selenium-python docs and a pretty good "getting started 之类的浏览器自动化工具”博客 post。当他们开始谈论测试框架时，您可以直接退出。

如何从仅使用一个 url 的网站抓取数据

how to scrape data from websites which uses just one url

lxml

web-scraping

python-3.x

python-requests