使用 Python 解析的 HTML 网页与实际页面不同

Question

我需要从 https://app.cpcbccr.com/AQI_India/ 中的 table 获取并存储 PM2.5 和 PM10 值。我使用 BeautifulSoup4 抓取网页，但我得到的解析 HTML 与实际页面不同。例如，我得到这个

而不是这个。

我编写了获取 table 行和 table 数据等所需的代码，但是由于我解析的 HTML 缺少 table 主体的行，它找不到它们，所以现在我只有这个可以看到我解析的 HTML:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://app.cpcbccr.com/AQI_India/"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

with open("Desktop/soup.html", "a") as dumpfile:
    dumpfile.write(str(soup))

如何获得所有 table？提前致谢。

Answer 1

试试下面的代码。我已经使用 API 方式为 https://app.cpcbccr.com/AQI_India/ 实现了数据抓取脚本。使用请求，您可以点击 API，它会发回您必须转换为 JSON 格式的结果。

import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_air_quality_index():
    
    payload = 'eyJzdGF0aW9uX2lkIjoic2l0ZV8zMDEiLCJkYXRlIjoiMjAyMC0wNy0yNFQ5OjAwOjAwWiJ9:'

    session = requests.Session()
    response = session.post('https://app.cpcbccr.com/aqi_dashboard/aqi_all_Parameters',data=payload,verify=False)
    result = json.loads(response.text) 
    extracted_metrics = result['metrics']
    print(extracted_metrics)

我检查了网络部分中的 API 调用，我从那里获得了 API url https://app.cpcbccr.com/aqi_dashboard/aqi_all_Parameters which i'm using for getting the data using an additional mandatory parameter which is a payload without this you will not be able to get the data. You can leverage script and add saving of data(refer screenshot ) 到 .csv 或 excel 文件。

API 的图片URL
json 指标结果的图像。

使用 Python 解析的 HTML 网页与实际页面不同

Parsed HTML using Python of a web page is different than the actual page

python-3.x

web-scraping

beautifulsoup

html-parsing