在脚本标签中抓取数据
Scrape data in script tag
任何人都可以建议一种在 <script>
标签中抓取数据的方法,具体来说,在这种情况下,来自 AEMO (https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html) 的 30 分钟 table。
要获取数据 table,我需要单击在网站上显示 table 的按钮或下载按钮。但是,这里的障碍是当我尝试使用 Selenium 抓取它时,table 的按钮和文本隐藏在 <script>
标签后面。
到目前为止,这是我的代码:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
url = "https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html"
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
browser.get(url)
try:
print(browser.page_source)
except:
print("not found")
finally:
browser.quit()
部分结果为:
<body aurelia-app="visualisation-main" data-gr-c-s-loaded="true">
<div class="splash">
<div class="message"><span class="icon-spinner"></span></div>
</div>
<script src="jspm_packages/system.js"></script>
<script src="config.js"></script>
<script>
System.import('aurelia-bootstrapper');
</script>
</body></html>
Selenium有自己的定位元素的方式,比如find_element_by_css_selector. And often times, browsers need some time to render elements, so you might need to use WebdriverWait.
以下是从页面中提取现货价格的示例:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html'
browser = webdriver.Chrome()
browser.get(url)
sel = 'body > div > compose > div > compose.fill-height.flex-container.au-target > compose > div > div:nth-child(1) > div'
element = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, sel))
)
print(element.text)
结果
.02/MWh
任何人都可以建议一种在 <script>
标签中抓取数据的方法,具体来说,在这种情况下,来自 AEMO (https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html) 的 30 分钟 table。
要获取数据 table,我需要单击在网站上显示 table 的按钮或下载按钮。但是,这里的障碍是当我尝试使用 Selenium 抓取它时,table 的按钮和文本隐藏在 <script>
标签后面。
到目前为止,这是我的代码:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
url = "https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html"
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
browser.get(url)
try:
print(browser.page_source)
except:
print("not found")
finally:
browser.quit()
部分结果为:
<body aurelia-app="visualisation-main" data-gr-c-s-loaded="true">
<div class="splash">
<div class="message"><span class="icon-spinner"></span></div>
</div>
<script src="jspm_packages/system.js"></script>
<script src="config.js"></script>
<script>
System.import('aurelia-bootstrapper');
</script>
</body></html>
Selenium有自己的定位元素的方式,比如find_element_by_css_selector. And often times, browsers need some time to render elements, so you might need to use WebdriverWait.
以下是从页面中提取现货价格的示例:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html'
browser = webdriver.Chrome()
browser.get(url)
sel = 'body > div > compose > div > compose.fill-height.flex-container.au-target > compose > div > div:nth-child(1) > div'
element = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, sel))
)
print(element.text)
结果
.02/MWh