通过 BeautifulSoup 抓取新 COVID 病例数时未获得编号
Not getting number when crawling number of new COVID cases through BeautifulSoup
晚上好,
我目前正在尝试从网站 (https://www.covid-19.sa.gov.au/home/dashboard) 中抓取南澳大利亚的 covid 病例编号。
我发现数字低于
<div id="convid19-data-visual" class="twbs">
<div class="container">
<div class="row southaus">
<div clsass="col-md-6 col-lg-4" style="padding:10px 25px">
<div class="st">
"New Cases"
<span class="nCasesa majorNum">64</span>
</div>
</div>
</div>
</div>
因此,我尝试通过应用以下代码来抓取号码:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.covid-19.sa.gov.au/home/dashboard")
soup = BeautifulSoup(result.text, "html.parser")
cases = soup.find("div", {"class" : "st"}
st = cases.find_all("span")
print(st)
我得到了
的结果
[<span class="nCasesa majorNum"> </span>]
不包括案件编号。
我也尝试过使用 selenium,但我也无法获得案例编号。我现在很困惑我找到的 HTML 标签是否正确。
如果可能的话,是否可以通过设置正确的 HTML 标签来解决这个问题?
谢谢!
与 class nCasesa 的 span 元素关联的文本是动态加载的 (JavaScript),在您的浏览器中呈现实际值存在延迟。您需要做的(使用 Selenium)是检测文本的变化。你可以这样做:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
CLASS = 'nCasesa'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
class detect():
def __init__(self, locator, params):
self.locator = locator
self.params = params
self.text = None
def gettext(self, driver):
return driver.find_element(self.locator, self.params).text
def __call__(self, driver):
if self.text is None:
self.text = self.gettext(driver)
else:
current = self.gettext(driver)
if current != self.text:
self.text = current
return True
return False
with webdriver.Chrome(options=options) as driver:
driver.get(f'https://www.covid-19.sa.gov.au/home/dashboard')
detector = detect(By.CLASS_NAME, CLASS)
WebDriverWait(driver, 10).until(detector)
print(detector.text)
输出:
73
数据动态来自每日 covid_19 数据文件。您可以从与 covid 相关的 js 源文件之一动态获取此文件。然后从每日文件
请求json数据
import requests, re
with requests.Session() as s:
r = s.get('https://www.covid-19.sa.gov.au/configuration/container-templates/data-visualisation/chartdata.js')
data_url = re.search(r'var dataSource = "(.*?)"', r.text).group(1)
data = s.get(data_url).json()
print(f"{data['hp_date']}: new cases = {data['newcase_sa']}")
晚上好,
我目前正在尝试从网站 (https://www.covid-19.sa.gov.au/home/dashboard) 中抓取南澳大利亚的 covid 病例编号。
我发现数字低于
<div id="convid19-data-visual" class="twbs">
<div class="container">
<div class="row southaus">
<div clsass="col-md-6 col-lg-4" style="padding:10px 25px">
<div class="st">
"New Cases"
<span class="nCasesa majorNum">64</span>
</div>
</div>
</div>
</div>
因此,我尝试通过应用以下代码来抓取号码:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.covid-19.sa.gov.au/home/dashboard")
soup = BeautifulSoup(result.text, "html.parser")
cases = soup.find("div", {"class" : "st"}
st = cases.find_all("span")
print(st)
我得到了
的结果[<span class="nCasesa majorNum"> </span>]
不包括案件编号。
我也尝试过使用 selenium,但我也无法获得案例编号。我现在很困惑我找到的 HTML 标签是否正确。
如果可能的话,是否可以通过设置正确的 HTML 标签来解决这个问题?
谢谢!
与 class nCasesa 的 span 元素关联的文本是动态加载的 (JavaScript),在您的浏览器中呈现实际值存在延迟。您需要做的(使用 Selenium)是检测文本的变化。你可以这样做:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
CLASS = 'nCasesa'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
class detect():
def __init__(self, locator, params):
self.locator = locator
self.params = params
self.text = None
def gettext(self, driver):
return driver.find_element(self.locator, self.params).text
def __call__(self, driver):
if self.text is None:
self.text = self.gettext(driver)
else:
current = self.gettext(driver)
if current != self.text:
self.text = current
return True
return False
with webdriver.Chrome(options=options) as driver:
driver.get(f'https://www.covid-19.sa.gov.au/home/dashboard')
detector = detect(By.CLASS_NAME, CLASS)
WebDriverWait(driver, 10).until(detector)
print(detector.text)
输出:
73
数据动态来自每日 covid_19 数据文件。您可以从与 covid 相关的 js 源文件之一动态获取此文件。然后从每日文件
请求json数据import requests, re
with requests.Session() as s:
r = s.get('https://www.covid-19.sa.gov.au/configuration/container-templates/data-visualisation/chartdata.js')
data_url = re.search(r'var dataSource = "(.*?)"', r.text).group(1)
data = s.get(data_url).json()
print(f"{data['hp_date']}: new cases = {data['newcase_sa']}")