无法使用 Playwright 和 BeautifulSoup 抓取元素
Cannot webscrape elements with Playwright and BeautifulSoup
我正在尝试将我的网络抓取脚本更新为网站(https://covid19.gov.vn/) have updated but I can't for my life found out how to parse these elements。检查元素似乎像往常一样,但我无法用 BeautifulSoup 解析它。我最初的尝试包括使用 Playwright 并再次尝试,但我仍然无法正确地抓取它。查看源代码,它几乎就像元素根本不存在一样。任何对 HTML 和网络抓取有更多了解的人都可以向我解释这是如何工作的吗?我几乎卡在这里
这基本上是我放弃查看页面源之前的最后一次尝试:
from bs4 import BeautifulSoup as bs
import requests
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://covid19.gov.vn/")
page_content = page.content()
soup = bs(page_content, features="lxml")
test = soup.findAll('div', class_ = "content-tab show", id="vi")
print(test)
browser.close()
我的想法是抓取并迭代其中的所有内容。但是,它不起作用。如果有人可以帮助我,我将不胜感激!谢谢!
试试下面的代码 - 它基于 HTTP GET 调用获取您要查找的数据。
import requests
r = requests.get('https://static.pipezero.com/covid/data.json')
if r.status_code == 200:
data = r.json()
print(data['total']['internal'])
输出
{'death': 17545, 'treating': 27876, 'cases': 707436, 'recovered': 475343}
我正在尝试将我的网络抓取脚本更新为网站(https://covid19.gov.vn/) have updated but I can't for my life found out how to parse these elements。检查元素似乎像往常一样,但我无法用 BeautifulSoup 解析它。我最初的尝试包括使用 Playwright 并再次尝试,但我仍然无法正确地抓取它。查看源代码,它几乎就像元素根本不存在一样。任何对 HTML 和网络抓取有更多了解的人都可以向我解释这是如何工作的吗?我几乎卡在这里
这基本上是我放弃查看页面源之前的最后一次尝试:
from bs4 import BeautifulSoup as bs
import requests
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://covid19.gov.vn/")
page_content = page.content()
soup = bs(page_content, features="lxml")
test = soup.findAll('div', class_ = "content-tab show", id="vi")
print(test)
browser.close()
我的想法是抓取并迭代其中的所有内容。但是,它不起作用。如果有人可以帮助我,我将不胜感激!谢谢!
试试下面的代码 - 它基于 HTTP GET 调用获取您要查找的数据。
import requests
r = requests.get('https://static.pipezero.com/covid/data.json')
if r.status_code == 200:
data = r.json()
print(data['total']['internal'])
输出
{'death': 17545, 'treating': 27876, 'cases': 707436, 'recovered': 475343}