使用 Python 从动态 Web 数据库中抓取数据
Scraping data from a dynamic web database with Python
我是 Python 的新手,目前正在研究如何从该网站抓取数据:
https://www.entsoe.eu/db-query/consumption/mhlv-a-specific-country-for-a-specific-month
我不确定我使用的是 Scrapy、BeautifulSoup 还是 Selenium。需要特定国家(例如德国 - 德国)2012-2014 年每个月和每天的数据。
非常感谢任何帮助。
你可以用 requests
(for maintaining a web-scraping session) + BeautifulSoup
(for HTML parsing) + regex for extracting a value of a javascript variable containing the desired data inside a script
tag and ast.literal_eval()
解决它,从 js 列表中创建一个 python 列表:
from ast import literal_eval
import re
from bs4 import BeautifulSoup
import requests
url = "https://www.entsoe.eu/db-query/consumption/mhlv-a-specific-country-for-a-specific-month"
payload = {
'opt_period': '0',
'opt_Country': '12', # 12 stands for DE here
'opt_Month': '1',
'opt_Year': '2014',
'opt_Response': '1',
'send': 'send',
'opt_period': '0'
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
session.get(url)
response = session.post(url, data=payload)
soup = BeautifulSoup(response.content)
script = soup.find('script', text=re.compile(r'Ext.onReady')).text
data = literal_eval(re.search(r"var myData = (.*?);", script, re.MULTILINE).group(1))
for row in data:
print row
打印:
['DE', '2014-01-01', 45424, 43537, 41773, 40716, 39945, 39014, 37282, 37573, 38225, 40639, 42884, 45332, 46285, 45671, 45293, 45840, 48863, 53721, 54607, 53691, 51219, 49701, 49099, 45850]
['DE', '2014-01-02', 42468, 40217, 39564, 39758, 41054, 43586, 48705, 54691, 58650, 61110, 62773, 64309, 64561, 63807, 62706, 61919, 63338, 66760, 66615, 64653, 60690, 57825, 55697, 51490]
['DE', '2014-01-03', 47538, 45125, 44358, 44748, 45815, 48024, 52151, 57564, 60767, 62425, 63654, 65152, 65273, 63591, 62195, 61722, 63311, 66785, 66668, 64317, 60460, 57727, 56084, 52332]
...
['DE', '2014-01-29', 57605, 55275, 54154, 54226, 55320, 58459, 66647, 73890, 75957, 75958, 76725, 77446, 76852, 76362, 75300, 74549, 73958, 77129, 78240, 76323, 71961, 68595, 66088, 61923]
['DE', '2014-01-30', 58207, 56235, 54953, 54873, 55861, 58952, 66756, 73747, 75479, 75507, 76249, 76763, 76013, 75291, 73975, 73267, 72717, 76181, 77765, 76038, 71807, 68369, 65580, 61414]
['DE', '2014-01-31', 57870, 55665, 54381, 54422, 55419, 58490, 65929, 72706, 74666, 74392, 74791, 74923, 73877, 72205, 70449, 69596, 69345, 73259, 74950, 72959, 68623, 65319, 63414, 59467]
Selenium 特定的方法会更少 "magical",但我认为这足以让你开始(并且对于一个研究工作最少的问题)。
我是 Python 的新手,目前正在研究如何从该网站抓取数据:
https://www.entsoe.eu/db-query/consumption/mhlv-a-specific-country-for-a-specific-month
我不确定我使用的是 Scrapy、BeautifulSoup 还是 Selenium。需要特定国家(例如德国 - 德国)2012-2014 年每个月和每天的数据。
非常感谢任何帮助。
你可以用 requests
(for maintaining a web-scraping session) + BeautifulSoup
(for HTML parsing) + regex for extracting a value of a javascript variable containing the desired data inside a script
tag and ast.literal_eval()
解决它,从 js 列表中创建一个 python 列表:
from ast import literal_eval
import re
from bs4 import BeautifulSoup
import requests
url = "https://www.entsoe.eu/db-query/consumption/mhlv-a-specific-country-for-a-specific-month"
payload = {
'opt_period': '0',
'opt_Country': '12', # 12 stands for DE here
'opt_Month': '1',
'opt_Year': '2014',
'opt_Response': '1',
'send': 'send',
'opt_period': '0'
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
session.get(url)
response = session.post(url, data=payload)
soup = BeautifulSoup(response.content)
script = soup.find('script', text=re.compile(r'Ext.onReady')).text
data = literal_eval(re.search(r"var myData = (.*?);", script, re.MULTILINE).group(1))
for row in data:
print row
打印:
['DE', '2014-01-01', 45424, 43537, 41773, 40716, 39945, 39014, 37282, 37573, 38225, 40639, 42884, 45332, 46285, 45671, 45293, 45840, 48863, 53721, 54607, 53691, 51219, 49701, 49099, 45850]
['DE', '2014-01-02', 42468, 40217, 39564, 39758, 41054, 43586, 48705, 54691, 58650, 61110, 62773, 64309, 64561, 63807, 62706, 61919, 63338, 66760, 66615, 64653, 60690, 57825, 55697, 51490]
['DE', '2014-01-03', 47538, 45125, 44358, 44748, 45815, 48024, 52151, 57564, 60767, 62425, 63654, 65152, 65273, 63591, 62195, 61722, 63311, 66785, 66668, 64317, 60460, 57727, 56084, 52332]
...
['DE', '2014-01-29', 57605, 55275, 54154, 54226, 55320, 58459, 66647, 73890, 75957, 75958, 76725, 77446, 76852, 76362, 75300, 74549, 73958, 77129, 78240, 76323, 71961, 68595, 66088, 61923]
['DE', '2014-01-30', 58207, 56235, 54953, 54873, 55861, 58952, 66756, 73747, 75479, 75507, 76249, 76763, 76013, 75291, 73975, 73267, 72717, 76181, 77765, 76038, 71807, 68369, 65580, 61414]
['DE', '2014-01-31', 57870, 55665, 54381, 54422, 55419, 58490, 65929, 72706, 74666, 74392, 74791, 74923, 73877, 72205, 70449, 69596, 69345, 73259, 74950, 72959, 68623, 65319, 63414, 59467]
Selenium 特定的方法会更少 "magical",但我认为这足以让你开始(并且对于一个研究工作最少的问题)。