Python 请求基于 URL 的程序从气候网站 (http://www.climate.weather.gc.ca) 自动批量下载数据
Python request for URL based procedure to automatically download data in bulk from Climate Website (http://www.climate.weather.gc.ca)
我正在尝试构建一个下载 .csv 并将其放入 pandas 数据框的程序。该说明建议我在 linux 上使用 wget,但是当我使用 'http.ID={a}/.data'.format(a)
从我为我必须监视的所有气象站制作的词典中插入不同的气象站时,它无法正常工作。这是加拿大政府网站的自述文件。
-------------------------------------------- --------------------------
Readme.txt
URL 基于程序自动从气候网站批量下载数据
(http://www.climate.weather.gc.ca)
版本:2016-05-10
加拿大环境与气候变化
要在线阅读此文件,请访问:
ftp://client_climate@ftp.tor.ec.gc.ca/Pub/Get_More_Data_Plus_de_donnees/
文件夹:Get_More_Data_Plus_de_donnees > Readme.txt
关于如何从加拿大环境与气候变化部的气候网站下载一个站点的所有天气数据的说明:
国家档案馆中每天更新的气候站列表,包括它们的气候 ID、站 ID、WMO ID、TC ID 和坐标,可在以下文件夹中找到:
Get_More_Data_Plus_de_donnees > 车站库存 EN.csv
使用以下实用程序下载数据:
wget(GNU / Linux 操作系统)
Cygwin(Windows 操作系统)https://www.cygwin.com
自制软件(OS X - Apple)http://brew.sh/
以 .csv 格式下载 1998 年至 2008 年耶洛奈夫 A 的所有可用 hourly 数据的示例
命令行:
for year in `seq 1998 2008`;do for month in `seq 1 12`;do wget -- content-disposition
"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=1706&Year=${year}&Month=${month}&Day=14&timeframe=1&submit= Download+Data" ;done;done
哪里;
年份 = 在命令行中更改值 (seq 1998 2008)
月 = 在命令行中更改值 (seq 1 12)
format= [csv|xml]: 格式输出
timeframe = 1: 对于 hourly 数据
timeframe = 2:每日数据
每月数据的时间范围 = 3
日:"day"变量的值未使用,可以是任意值
对于另一个站,改变变量stationID的值
对于XML格式的数据,将URL中的变量格式的值改为xml。
如需法语信息,请将下载+数据更改为
++T%C3%A9l%C3%A9charger+%0D%0Ades+donn%C3%A9es,同样把url.
中的_e换成_f
如有问题或疑虑,请联系我们的国家气候服务办公室:ec.services.climatiques-climate.services.ec@canada.ca
-------------------------------------------- --------------------------
我最初是使用 wget 从这个 link 下载一个 csv 文件。它在没有 .format(ID,year) 的情况下工作......
这个有效:
"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=50308&Year=2019&Month=3&Day=14&timeframe=2&submit= Download+Data"
但这不是:
"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year)
我需要能够插入不同的年份和电台 ID。
这是行不通的,无论 ID 是什么,我仍然得到相同的天气。
它产生了一个结果,但它不是 ID 为 50308 的气象站。
ID = '50308'
year = '2019'
!wget -O Weather.csv"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year)
df = pd.read_csv('Weather.csv',skiprows = 24)
我试图用以下语句替换上面的语句:
import pandas as pd
import io
import requests
ID = '49088'
year = '2019'
url="http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year)
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
这是它吐出的错误代码:
ParserError: Error tokenizing data. C error: Expected 2 fields in line 26, saw 27
我希望能够为气象站名称和 ID 创建一个字典,这样我就可以创建一个函数并通过下载并放入熊猫数据框中的函数迭代字典。
请求函数可以很好地获取 .csv,错误是 pandas 无法正确读取 csv。下载的文件以包含空行和正确数据之前的两个字段的行开头。也许你不需要将介绍转换成熊猫:
"Station Name","DELTA BURNS BOG"
"Province","BRITISH COLUMBIA"
"Current Station Operator","Environment and Climate Change Canada - Meteorological Service of Canada"
"Latitude","49.13"
"Longitude","-123.00"
"Elevation","3.10"
.. etc ...
对于前 24 行,然后是 space,其余是您的数据:
"Date/Time","Year","Month","Day","Data Quality","Max Temp (°C)","Max Temp Flag","Min Temp (°C)","Min Temp Flag","Mean Temp (°C)","Mean Temp Flag","Heat Deg Days (°C)","Heat Deg Days Flag","Cool Deg Days (°C)","Cool Deg Days Flag","Total Rain (mm)","Total Rain Flag","Total Snow (cm)","Total Snow Flag","Total Precip (mm)","Total Precip Flag","Snow on Grnd (cm)","Snow on Grnd Flag","Dir of Max Gust (10s deg)","Dir of Max Gust Flag","Spd of Max Gust (km/h)","Spd of Max Gust Flag"
"2019-01-01","2019","01","01","","5.3","","-0.6","","2.4","","15.6","","0.0","","","","","M","0.0","","","","","","",""
"2019-01-02","2019","01","02","","5.2","","0.6","","2.9","","15.1","","0.0","","","","","M","3.4","","","","","","",""
"2019-01-03","2019","01","03","","9.1","","3.4","","6.2","","11.8","","0.0","","","","","M","61.0","","","","","","",""
...
因此,如果您告诉 pandas 跳过前 25(?)行,您应该避免解析问题:
h=pd.read_csv(io.StringIO(s.decode('utf-8')), skiprows = 25)
但话又说回来,也许你真的需要这些线。
(我真的不知道 pandas 所以希望很快就会出现更聪明的话)。
我正在尝试构建一个下载 .csv 并将其放入 pandas 数据框的程序。该说明建议我在 linux 上使用 wget,但是当我使用 'http.ID={a}/.data'.format(a)
从我为我必须监视的所有气象站制作的词典中插入不同的气象站时,它无法正常工作。这是加拿大政府网站的自述文件。
-------------------------------------------- --------------------------
Readme.txt
URL 基于程序自动从气候网站批量下载数据 (http://www.climate.weather.gc.ca) 版本:2016-05-10
加拿大环境与气候变化
要在线阅读此文件,请访问:
ftp://client_climate@ftp.tor.ec.gc.ca/Pub/Get_More_Data_Plus_de_donnees/
文件夹:Get_More_Data_Plus_de_donnees > Readme.txt
关于如何从加拿大环境与气候变化部的气候网站下载一个站点的所有天气数据的说明:
国家档案馆中每天更新的气候站列表,包括它们的气候 ID、站 ID、WMO ID、TC ID 和坐标,可在以下文件夹中找到:
Get_More_Data_Plus_de_donnees > 车站库存 EN.csv
使用以下实用程序下载数据: wget(GNU / Linux 操作系统) Cygwin(Windows 操作系统)https://www.cygwin.com 自制软件(OS X - Apple)http://brew.sh/ 以 .csv 格式下载 1998 年至 2008 年耶洛奈夫 A 的所有可用 hourly 数据的示例
命令行:
for year in `seq 1998 2008`;do for month in `seq 1 12`;do wget -- content-disposition
"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=1706&Year=${year}&Month=${month}&Day=14&timeframe=1&submit= Download+Data" ;done;done
哪里;
年份 = 在命令行中更改值 (seq 1998 2008)
月 = 在命令行中更改值 (seq 1 12)
format= [csv|xml]: 格式输出
timeframe = 1: 对于 hourly 数据
timeframe = 2:每日数据
每月数据的时间范围 = 3
日:"day"变量的值未使用,可以是任意值
对于另一个站,改变变量stationID的值
对于XML格式的数据,将URL中的变量格式的值改为xml。
如需法语信息,请将下载+数据更改为
++T%C3%A9l%C3%A9charger+%0D%0Ades+donn%C3%A9es,同样把url.
中的_e换成_f如有问题或疑虑,请联系我们的国家气候服务办公室:ec.services.climatiques-climate.services.ec@canada.ca
-------------------------------------------- --------------------------
我最初是使用 wget 从这个 link 下载一个 csv 文件。它在没有 .format(ID,year) 的情况下工作......
这个有效:
"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=50308&Year=2019&Month=3&Day=14&timeframe=2&submit= Download+Data"
但这不是:
"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year)
我需要能够插入不同的年份和电台 ID。
这是行不通的,无论 ID 是什么,我仍然得到相同的天气。 它产生了一个结果,但它不是 ID 为 50308 的气象站。
ID = '50308'
year = '2019'
!wget -O Weather.csv"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year)
df = pd.read_csv('Weather.csv',skiprows = 24)
我试图用以下语句替换上面的语句:
import pandas as pd
import io
import requests
ID = '49088'
year = '2019'
url="http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year)
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
这是它吐出的错误代码:
ParserError: Error tokenizing data. C error: Expected 2 fields in line 26, saw 27
我希望能够为气象站名称和 ID 创建一个字典,这样我就可以创建一个函数并通过下载并放入熊猫数据框中的函数迭代字典。
请求函数可以很好地获取 .csv,错误是 pandas 无法正确读取 csv。下载的文件以包含空行和正确数据之前的两个字段的行开头。也许你不需要将介绍转换成熊猫:
"Station Name","DELTA BURNS BOG"
"Province","BRITISH COLUMBIA"
"Current Station Operator","Environment and Climate Change Canada - Meteorological Service of Canada"
"Latitude","49.13"
"Longitude","-123.00"
"Elevation","3.10"
.. etc ...
对于前 24 行,然后是 space,其余是您的数据:
"Date/Time","Year","Month","Day","Data Quality","Max Temp (°C)","Max Temp Flag","Min Temp (°C)","Min Temp Flag","Mean Temp (°C)","Mean Temp Flag","Heat Deg Days (°C)","Heat Deg Days Flag","Cool Deg Days (°C)","Cool Deg Days Flag","Total Rain (mm)","Total Rain Flag","Total Snow (cm)","Total Snow Flag","Total Precip (mm)","Total Precip Flag","Snow on Grnd (cm)","Snow on Grnd Flag","Dir of Max Gust (10s deg)","Dir of Max Gust Flag","Spd of Max Gust (km/h)","Spd of Max Gust Flag"
"2019-01-01","2019","01","01","","5.3","","-0.6","","2.4","","15.6","","0.0","","","","","M","0.0","","","","","","",""
"2019-01-02","2019","01","02","","5.2","","0.6","","2.9","","15.1","","0.0","","","","","M","3.4","","","","","","",""
"2019-01-03","2019","01","03","","9.1","","3.4","","6.2","","11.8","","0.0","","","","","M","61.0","","","","","","",""
...
因此,如果您告诉 pandas 跳过前 25(?)行,您应该避免解析问题:
h=pd.read_csv(io.StringIO(s.decode('utf-8')), skiprows = 25)
但话又说回来,也许你真的需要这些线。 (我真的不知道 pandas 所以希望很快就会出现更聪明的话)。