通过 BeautifulSoup 抓取新 COVID 病例数时未获得编号

Not getting number when crawling number of new COVID cases through BeautifulSoup

晚上好,

我目前正在尝试从网站 (https://www.covid-19.sa.gov.au/home/dashboard) 中抓取南澳大利亚的 covid 病例编号。

我发现数字低于

<div id="convid19-data-visual" class="twbs">
<div class="container">
    <div class="row southaus">
        <div clsass="col-md-6 col-lg-4" style="padding:10px 25px">
            <div class="st">
                "New Cases"
                <span class="nCasesa majorNum">64</span>
            </div>
        </div>
    </div>
</div>

因此,我尝试通过应用以下代码来抓取号码:

import requests
from bs4 import BeautifulSoup

result = requests.get("https://www.covid-19.sa.gov.au/home/dashboard")
soup = BeautifulSoup(result.text, "html.parser")
cases = soup.find("div", {"class" : "st"}
st = cases.find_all("span")
print(st)

我得到了

的结果
[<span class="nCasesa majorNum"> </span>]

不包括案件编号。

我也尝试过使用 selenium,但我也无法获得案例编号。我现在很困惑我找到的 HTML 标签是否正确。

如果可能的话,是否可以通过设置正确的 HTML 标签来解决这个问题?

谢谢!

与 class nCasesa 的 span 元素关联的文本是动态加载的 (JavaScript),在您的浏览器中呈现实际值存在延迟。您需要做的(使用 Selenium)是检测文本的变化。你可以这样做:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

CLASS = 'nCasesa'
options = webdriver.ChromeOptions()
options.add_argument('--headless')

class detect():
    def __init__(self, locator, params):
        self.locator = locator
        self.params = params
        self.text = None

    def gettext(self, driver):
        return driver.find_element(self.locator, self.params).text

    def __call__(self, driver):
        if self.text is None:
            self.text = self.gettext(driver)
        else:
            current = self.gettext(driver)
            if current != self.text:
                self.text = current
                return True
        return False

with webdriver.Chrome(options=options) as driver:
    driver.get(f'https://www.covid-19.sa.gov.au/home/dashboard')
    detector = detect(By.CLASS_NAME, CLASS)
    WebDriverWait(driver, 10).until(detector)
    print(detector.text)

输出:

73

数据动态来自每日 covid_19 数据文件。您可以从与 covid 相关的 js 源文件之一动态获取此文件。然后从每日文件

请求json数据
import requests, re

with requests.Session() as s:
    r = s.get('https://www.covid-19.sa.gov.au/configuration/container-templates/data-visualisation/chartdata.js')
    data_url = re.search(r'var dataSource = "(.*?)"', r.text).group(1)
    data = s.get(data_url).json()
    
print(f"{data['hp_date']}: new cases = {data['newcase_sa']}")