为包含 _dopostback 方法的多个页面抓取网站,并且 URL 不会因页面而改变

Scraping a website for multiple pages that contains _dopostback method and the URL doesn't change for the pages

我正在使用 BeautifulSouphttps://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019
总共有两页信息,要浏览这些页面,顶部和底部都有几个链接,如 1,2。这些链接使用 _dopostback

href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView2','Page')"

问题是当我们尝试从一个页面导航到另一个页面时,Url 不会仅更改粗体文本更改,即对于第 1 页它是 Page,对于第 2 页它是 Page。如何使用 BeautifulSoup 遍历多个页面并提取信息?表单数据如下

ctl00$ScriptManager1: ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2 ctl00$ContentPlaceHolder1$ddl_District: 019 ctl00$ContentPlaceHolder1$rdo_Govt_Flag: G __EVENTTARGET: ctl00$ContentPlaceHolder1$GridView2 __EVENTARGUMENT: Page

表单数据中还有一个变量叫_VIEWSTATE,但是内容实在是太大了。 我查看了多个建议查看 post 调用和使用参数的解决方案和帖子,但我无法理解 post.

中提供的参数

您可以使用此示例如何使用 requests:

加载本网站的下一页
import requests
from bs4 import BeautifulSoup


url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(requests.get(url).content, "html.parser")


def load_page(soup, page_num):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
    }

    payload = {
        "ctl00$ScriptManager1": "ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2",
        "__EVENTTARGET": "ctl00$ContentPlaceHolder1$GridView2",
        "__EVENTARGUMENT": "Page${}".format(page_num),
        "__LASTFOCUS": "",
        "__ASYNCPOST": "true",
    }

    for inp in soup.select("input"):
        payload[inp["name"]] = inp.get("value")

    payload["ctl00$ContentPlaceHolder1$ddl_District"] = "019"
    payload["ctl00$ContentPlaceHolder1$rdo_Govt_Flag"] = "G"
    del payload["ctl00$ContentPlaceHolder1$chk_Available"]

    api_url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
    soup = BeautifulSoup(
        requests.post(api_url, data=payload, headers=headers).content,
        "html.parser",
    )
    return soup


# print hospitals from first page:
for h5 in soup.select("h5"):
    print(h5.text)

# load second page
soup = load_page(soup, 2)

# print hospitals from second page
for h5 in soup.select("h5"):
    print(h5.text)

打印:

 AMRI, Salt Lake - Vivekananda Yuba Bharati Krirangan Salt Lake Stadium (Satellite Govt. Building)
 Calcutta National Medical College and Hospital (Government Hospital)
 CHITTARANJAN NATIONAL CANCER INSTITUTE-CNCI (Government Hospital)
 College of Medicine  Sagore Dutta Hospital (Government Hospital)
 ESI Hospital Maniktala (Government Hospital)
 ESI Hospital Sealdah (Government Hospital)
 I.D. And B.G. Hospital (Government Hospital)
 M R Bangur Hospital (Government Hospital)
 Medical College and Hospital, Kolkata, (Government Hospital)
 Nil Ratan Sarkar Medical College and Hospital (Government Hospital)
 R. G. Kar Medical College and Hospital  (Government Hospital)
 Sambhunath Pandit Hospital (Government Hospital)