具有重复 class 名称的嵌套 div 元素的 Web 抓取
Web Scraping of nested div elements with repeating class names
<div class="information_row" id="dashboard">
<a href="#statewise-data" class="state_link" title="Statewise">Statewise</a>
<div class="info_title1">Cases Across India</div>
<div class="active-case">
<div class="block-active-cases">
<span class="icount">3,86,351</span>
<div class="increase_block">
<div class="color-green down-arrow">
2,157 <i></i>
</div>
</div>
</div>
<div class="info_label">Active Cases
<span class="per_block">(1.21%)</span>
</div>
</div>
<div class="iblock discharge">
<div class="iblock_text">
<div class="info_label"> Discharged
<div class="per_block">
(97.45%)
</div>
</div>
<span class="icount">3,12,20,981</span>
<div class="increase_block">
<div class="color-green up-arrow">
40,013 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock death_case">
<div class="iblock_text">
<div class="info_label">Deaths
<div class="per_block">
(1.34%)
</div>
</div>
<span class="icount">4,29,179</span>
<div class="increase_block">
<div class="color-red up-arrow">
497 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock t_case">
<div class="iblock_text">
<div class="info_label">Total Cases
<div class="per_block"></div>
</div>
<span class="icount">3,20,36,511</span>
<div class="increase_block">
<div class="color-red up-arrow">
38,353 <i></i>
</div>
</div>
</div>
</div></div>
我正在使用 python 和 beautifulsoup 开发网络 抓取 项目。作为初学者,我无法解析我需要的数据(covid 上的数值统计),因为包含数值数据的 class 名称是重复的,不像 icount
、per_block
那样唯一increase_block
。我想要的是只解析这些数字数据并将其存储在不同的变量中,如下所示-
- Total_cases = 3,20,36,511
- Total_cases_in_last_24_hrs = 38,353 所有其他类别(出院、死亡、活跃病例)也是如此
这是我的代码-
URL = 'https://www.mygov.in/covid-19/'
page = requests.get(URL,headers=headers)
clean_data=BeautifulSoup(page.text,'html.parser')
span=clean_data.findAll('span',class_='icount')
#print(clean_data)
total_cases = clean_data.find("div",class_="iblock
t_case",attrs={'spanclass':'icount'}).get_text()
print(total_cases)
我已经研究了很长时间,但找不到解决方案。请帮忙。
这是来自 Click here to visit the website.
的参考代码
谢谢。
一个可能的解决方案是 select 来自 class="t_case"
的所有文本并拆分文本:
import requests
from bs4 import BeautifulSoup
url = "https://www.mygov.in/covid-19/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
_, total_cases, new_cases = (
soup.select_one(".t_case").get_text(strip=True, separator="|").split("|")
)
print(total_cases)
print(new_cases)
打印:
3,20,36,511
38,353
或者:
t_case = soup.select_one(".t_case")
total_cases = t_case.select_one(".icount")
new_cases = t_case.select_one(".color-red, .color-green")
print(total_cases.get_text(strip=True))
print(new_cases.get_text(strip=True))
<div class="information_row" id="dashboard">
<a href="#statewise-data" class="state_link" title="Statewise">Statewise</a>
<div class="info_title1">Cases Across India</div>
<div class="active-case">
<div class="block-active-cases">
<span class="icount">3,86,351</span>
<div class="increase_block">
<div class="color-green down-arrow">
2,157 <i></i>
</div>
</div>
</div>
<div class="info_label">Active Cases
<span class="per_block">(1.21%)</span>
</div>
</div>
<div class="iblock discharge">
<div class="iblock_text">
<div class="info_label"> Discharged
<div class="per_block">
(97.45%)
</div>
</div>
<span class="icount">3,12,20,981</span>
<div class="increase_block">
<div class="color-green up-arrow">
40,013 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock death_case">
<div class="iblock_text">
<div class="info_label">Deaths
<div class="per_block">
(1.34%)
</div>
</div>
<span class="icount">4,29,179</span>
<div class="increase_block">
<div class="color-red up-arrow">
497 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock t_case">
<div class="iblock_text">
<div class="info_label">Total Cases
<div class="per_block"></div>
</div>
<span class="icount">3,20,36,511</span>
<div class="increase_block">
<div class="color-red up-arrow">
38,353 <i></i>
</div>
</div>
</div>
</div></div>
我正在使用 python 和 beautifulsoup 开发网络 抓取 项目。作为初学者,我无法解析我需要的数据(covid 上的数值统计),因为包含数值数据的 class 名称是重复的,不像 icount
、per_block
那样唯一increase_block
。我想要的是只解析这些数字数据并将其存储在不同的变量中,如下所示-
- Total_cases = 3,20,36,511
- Total_cases_in_last_24_hrs = 38,353 所有其他类别(出院、死亡、活跃病例)也是如此
这是我的代码-
URL = 'https://www.mygov.in/covid-19/'
page = requests.get(URL,headers=headers)
clean_data=BeautifulSoup(page.text,'html.parser')
span=clean_data.findAll('span',class_='icount')
#print(clean_data)
total_cases = clean_data.find("div",class_="iblock
t_case",attrs={'spanclass':'icount'}).get_text()
print(total_cases)
我已经研究了很长时间,但找不到解决方案。请帮忙。 这是来自 Click here to visit the website.
的参考代码谢谢。
一个可能的解决方案是 select 来自 class="t_case"
的所有文本并拆分文本:
import requests
from bs4 import BeautifulSoup
url = "https://www.mygov.in/covid-19/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
_, total_cases, new_cases = (
soup.select_one(".t_case").get_text(strip=True, separator="|").split("|")
)
print(total_cases)
print(new_cases)
打印:
3,20,36,511
38,353
或者:
t_case = soup.select_one(".t_case")
total_cases = t_case.select_one(".icount")
new_cases = t_case.select_one(".color-red, .color-green")
print(total_cases.get_text(strip=True))
print(new_cases.get_text(strip=True))