使用 R 从通过电子邮件发送下载 link 的网页中自动抓取数据
Automatically scraping data from web page that e-mails a download link using R
我正在尝试从以下网站下载数据:https://mrcc.illinois.edu/cliwatch/northAmerPcpn/getArchive.jsp
我的最终目标是创建一个脚本,我可以将其设置为每天 运行 以抓取昨天的实际降水数据作为加拿大和北美的 CSV。这意味着我需要选择以下选项(按顺序):实际、逗号分隔值、MPE、加拿大和北美。然后根据前一天的日期适当设置开始日期和结束日期。
我成功创建了以下脚本,它为我导航选择(我通过查看选择后生成的 URL 表单来做到这一点):
library(lubridate)
yesterday_year <- lubridate::year(Sys.Date()-1)
yesterday_month <- lubridate::month(Sys.Date()-1)
yesterday <- lubridate::day(Sys.Date()-1)
mrcc.site <- 'https://mrcc.illinois.edu/cliwatch/northAmerPcpn/getArchive2.jsp?datatype=actual&dataformat=csv&dataset=mpe®=northAmer&syr='
mrcc_smo_1 <- '&smo='
mrcc_dy_1 <- '&sdy='
mrcc_yr_1 <- '&eyr='
mrcc_smo_2 <- '&emo='
mrcc_dy_2 <- '&edy='
mrcc_yr_2 <- '&edy='
email <- '&email=myemail%40gmail.com'
yesterday_year <- paste(yesterday_year)
yesterday_month <- paste(yesterday_month)
yesterday <- paste(yesterday)
download.url <- paste(mrcc.site, yesterday_year, mrcc_smo_1, yesterday_month, mrcc_dy_1, yesterday, mrcc_yr_1, yesterday_year, mrcc_smo_2, yesterday_month, mrcc_dy_2, yesterday, email, sep="")
browseURL(download.url, browser = getOption("Chrome"),
encodeIfNeeded = FALSE)
最后一段代码可以打开创建的link
我现在的问题是网站设置为通过电子邮件向您发送 .tar.gz 文件下载 link,这对我来说确实很不方便。我只是想有一种方法可以使用我的脚本自动将此文件下载到我的计算机,而不必手动进入我的电子邮件并单击此 link。有没有办法将 link 抓取到生成的下载文件中?也许来自页面本身而不是来自我的电子邮件?
在此先感谢您的帮助!
好的,我想你可以这样做并绕过电子邮件...
基本上,当您请求文件时,它会给您一个时间戳。那是用来生成下载的link。尝试一下,让我知道它是否有效...
library(lubridate)
library(httr)
start_year <- lubridate::year(Sys.Date()-100)
start_month <- lubridate::month(Sys.Date()-100)
start_day <- lubridate::day(Sys.Date()-100)
end_year <- lubridate::year(Sys.Date())
end_month <- lubridate::month(Sys.Date())
end_day <- lubridate::day(Sys.Date())
mrcc.site <- 'https://mrcc.illinois.edu/cliwatch/northAmerPcpn/getArchive2.jsp'
query <- list("datatype" = "actual",
"dataformat" = "csv",
"dataset" = "mpe",
"reg" = "northAmer",
"syr" = start_year,
"smo" = start_month,
"sdy" = start_day,
"eyr" = end_year,
"emo" = end_month,
"edy" = end_day,
"email" = "a@a.com")
# request that file is generated
response <- GET(mrcc.site, query = query)
# build the download file url
response_date <- format(with_tz(response$date, tzone = "America/Chicago"),
"%Y%m%d%H%M%S")
file_url <- paste0("http://mrcc.illinois.edu/cliwatch/northAmerPcpn/dataRetr/data",
response_date, ".tar.gz")
# wait some time...
file_header <- headers(HEAD(file_url))[["Content-Length"]]
if(file_header > 113 & file_header != 1126){
# file is bigger than 113 (so it has been generated), and is not 1126 (no file exists)
download.file(file_url, destfile="c:/tmp.tar.gz")
}
我正在尝试从以下网站下载数据:https://mrcc.illinois.edu/cliwatch/northAmerPcpn/getArchive.jsp
我的最终目标是创建一个脚本,我可以将其设置为每天 运行 以抓取昨天的实际降水数据作为加拿大和北美的 CSV。这意味着我需要选择以下选项(按顺序):实际、逗号分隔值、MPE、加拿大和北美。然后根据前一天的日期适当设置开始日期和结束日期。
我成功创建了以下脚本,它为我导航选择(我通过查看选择后生成的 URL 表单来做到这一点):
library(lubridate)
yesterday_year <- lubridate::year(Sys.Date()-1)
yesterday_month <- lubridate::month(Sys.Date()-1)
yesterday <- lubridate::day(Sys.Date()-1)
mrcc.site <- 'https://mrcc.illinois.edu/cliwatch/northAmerPcpn/getArchive2.jsp?datatype=actual&dataformat=csv&dataset=mpe®=northAmer&syr='
mrcc_smo_1 <- '&smo='
mrcc_dy_1 <- '&sdy='
mrcc_yr_1 <- '&eyr='
mrcc_smo_2 <- '&emo='
mrcc_dy_2 <- '&edy='
mrcc_yr_2 <- '&edy='
email <- '&email=myemail%40gmail.com'
yesterday_year <- paste(yesterday_year)
yesterday_month <- paste(yesterday_month)
yesterday <- paste(yesterday)
download.url <- paste(mrcc.site, yesterday_year, mrcc_smo_1, yesterday_month, mrcc_dy_1, yesterday, mrcc_yr_1, yesterday_year, mrcc_smo_2, yesterday_month, mrcc_dy_2, yesterday, email, sep="")
browseURL(download.url, browser = getOption("Chrome"),
encodeIfNeeded = FALSE)
最后一段代码可以打开创建的link
我现在的问题是网站设置为通过电子邮件向您发送 .tar.gz 文件下载 link,这对我来说确实很不方便。我只是想有一种方法可以使用我的脚本自动将此文件下载到我的计算机,而不必手动进入我的电子邮件并单击此 link。有没有办法将 link 抓取到生成的下载文件中?也许来自页面本身而不是来自我的电子邮件?
在此先感谢您的帮助!
好的,我想你可以这样做并绕过电子邮件...
基本上,当您请求文件时,它会给您一个时间戳。那是用来生成下载的link。尝试一下,让我知道它是否有效...
library(lubridate)
library(httr)
start_year <- lubridate::year(Sys.Date()-100)
start_month <- lubridate::month(Sys.Date()-100)
start_day <- lubridate::day(Sys.Date()-100)
end_year <- lubridate::year(Sys.Date())
end_month <- lubridate::month(Sys.Date())
end_day <- lubridate::day(Sys.Date())
mrcc.site <- 'https://mrcc.illinois.edu/cliwatch/northAmerPcpn/getArchive2.jsp'
query <- list("datatype" = "actual",
"dataformat" = "csv",
"dataset" = "mpe",
"reg" = "northAmer",
"syr" = start_year,
"smo" = start_month,
"sdy" = start_day,
"eyr" = end_year,
"emo" = end_month,
"edy" = end_day,
"email" = "a@a.com")
# request that file is generated
response <- GET(mrcc.site, query = query)
# build the download file url
response_date <- format(with_tz(response$date, tzone = "America/Chicago"),
"%Y%m%d%H%M%S")
file_url <- paste0("http://mrcc.illinois.edu/cliwatch/northAmerPcpn/dataRetr/data",
response_date, ".tar.gz")
# wait some time...
file_header <- headers(HEAD(file_url))[["Content-Length"]]
if(file_header > 113 & file_header != 1126){
# file is bigger than 113 (so it has been generated), and is not 1126 (no file exists)
download.file(file_url, destfile="c:/tmp.tar.gz")
}