使用 requests-html 进行网络抓取 - 如何从网站收集一个简单的数字?
Web scraping using requests-html - How does one collect a simple number from a website?
我正在尝试从电力数据网站收集数据点:
electricityMap | Live CO₂ emissions of electricity consumption
到目前为止我已经写了这段代码:
from requests_html import HTMLSession #import libraries
s = HTMLSession()
url = 'https://app.electricitymap.org/zone/DK-DK2'
r = s.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'})
webpageTitle = (r.html.find('title', first=True).text)
print(webpageTitle)
我可以让 VS Code 打印出网站的标题,但我只对给定时刻的可再生能源数量感兴趣。这在网站左上角显示为“可再生”表盘。
我查看了网站并找到了我要收集的值:Screenshot of Chrome DevTools。
我需要写什么才能在 Python 中打印此值?
正如@Tim Roberts 所说,该网站完全是通过 Javascrip 构建的。我测试了 requests_html
和 selenium
。 requests_html 给出空输出意味着无法渲染 JavaScript 但 selenium 产生完美的输出。
from requests_html import HTMLSession #import libraries
from bs4 import BeautifulSoup as bs
s = HTMLSession()
url = 'https://app.electricitymap.org/zone/DK-DK2'
r = s.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'})
soup=bs(r.text,'html.parser')
renewable=[x.get_text() for x in soup.select('g[class="circular-gauge"] text')]
print(renewable)
输出:
[]
#Selenium: You have nothing to install just you can run the code
from bs4 import BeautifulSoup as bs
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://app.electricitymap.org/zone/DK-DK2'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(2)
soup=bs(driver.page_source,'html.parser')
renewable=[x.get_text() for x in soup.select('g[class="circular-gauge"] text')][1]
print(renewable)
输出:
69%
我正在尝试从电力数据网站收集数据点:
electricityMap | Live CO₂ emissions of electricity consumption
到目前为止我已经写了这段代码:
from requests_html import HTMLSession #import libraries
s = HTMLSession()
url = 'https://app.electricitymap.org/zone/DK-DK2'
r = s.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'})
webpageTitle = (r.html.find('title', first=True).text)
print(webpageTitle)
我可以让 VS Code 打印出网站的标题,但我只对给定时刻的可再生能源数量感兴趣。这在网站左上角显示为“可再生”表盘。
我查看了网站并找到了我要收集的值:Screenshot of Chrome DevTools。
我需要写什么才能在 Python 中打印此值?
正如@Tim Roberts 所说,该网站完全是通过 Javascrip 构建的。我测试了 requests_html
和 selenium
。 requests_html 给出空输出意味着无法渲染 JavaScript 但 selenium 产生完美的输出。
from requests_html import HTMLSession #import libraries
from bs4 import BeautifulSoup as bs
s = HTMLSession()
url = 'https://app.electricitymap.org/zone/DK-DK2'
r = s.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'})
soup=bs(r.text,'html.parser')
renewable=[x.get_text() for x in soup.select('g[class="circular-gauge"] text')]
print(renewable)
输出:
[]
#Selenium: You have nothing to install just you can run the code
from bs4 import BeautifulSoup as bs
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://app.electricitymap.org/zone/DK-DK2'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(2)
soup=bs(driver.page_source,'html.parser')
renewable=[x.get_text() for x in soup.select('g[class="circular-gauge"] text')][1]
print(renewable)
输出:
69%