如何从 download.file 请求中捕获 HTTP 错误代码?
How do I capture the HTTP error code from a download.file request?
此代码尝试下载不存在的页面:
url <- "https://en.wikipedia.org/asdfasdfasdf"
status_code <- download.file(url, destfile = "output.html", method = "libcurl")
这个 returns 404 错误:
trying URL 'https://en.wikipedia.org/asdfasdfasdf'
Error in download.file(url, destfile = "output.html", method = "libcurl") :
cannot open URL 'https://en.wikipedia.org/asdfasdfasdf'
In addition: Warning message:
In download.file(url, destfile = "output.html", method = "libcurl") :
cannot open URL 'https://en.wikipedia.org/asdfasdfasdf': HTTP status was '404 Not Found'
但是 code
变量仍然包含 0,即使 download.file
的文档指出返回值为:
An (invisible) integer code, 0 for success and non-zero for failure. For the "wget" and "curl" methods this is the status code returned by the external program. The "internal" method can return 1, but will in most cases throw an error.
如果我使用curl
或wget
作为下载方式,结果是一样的。我在这里错过了什么?是调用 warnings()
并解析输出的唯一选项吗?
我已经看到 other questions 关于使用 download.file
,但是 none(我可以找到)实际检索 HTTP 状态代码。
可能最好的选择是直接使用 cURL 库,而不是通过 download.file
包装器,后者不会公开 cURL 的全部功能。我们可以做到这一点,例如,使用 RCurl 包(尽管其他包如 httr 或系统调用也可以实现同样的事情)。直接使用 cURL 将允许您访问 cURL 信息,包括响应代码。例如:
library(RCurl)
curl = getCurlHandle()
x = getURL("https://en.wikipedia.org/asdfasdfasdf", curl = curl)
write(x, 'output.html')
getCurlInfo(curl)$response.code
# [1] 404
虽然上面的第一个选项更简洁,但如果您真的想使用 download.file
,一种可能的方法是使用 withCallingHandlers
捕获警告
try(withCallingHandlers(
download.file(url, destfile = "output.html", method = "libcurl"),
warning = function(w) {
my.warning <<- sub(".+HTTP status was ", "", w)
}),
silent = TRUE)
cat(my.warning)
'404 Not Found'
如果您不介意使用其他方法,您可以尝试 httr
包中的 GET
:
url_200 <- "https://en.wikipedia.org/wiki/R_(programming_language)"
url_404 <- "https://en.wikipedia.org/asdfasdfasdf"
# OK
raw_200 <- httr::GET(url_200)
raw_200$status_code
#> [1] 200
# Not found
raw_404 <- httr::GET(url_404)
raw_404$status_code
#> [1] 404
由 reprex package (v0.2.1)
创建于 2019-01-02
此代码尝试下载不存在的页面:
url <- "https://en.wikipedia.org/asdfasdfasdf"
status_code <- download.file(url, destfile = "output.html", method = "libcurl")
这个 returns 404 错误:
trying URL 'https://en.wikipedia.org/asdfasdfasdf'
Error in download.file(url, destfile = "output.html", method = "libcurl") :
cannot open URL 'https://en.wikipedia.org/asdfasdfasdf'
In addition: Warning message:
In download.file(url, destfile = "output.html", method = "libcurl") :
cannot open URL 'https://en.wikipedia.org/asdfasdfasdf': HTTP status was '404 Not Found'
但是 code
变量仍然包含 0,即使 download.file
的文档指出返回值为:
An (invisible) integer code, 0 for success and non-zero for failure. For the "wget" and "curl" methods this is the status code returned by the external program. The "internal" method can return 1, but will in most cases throw an error.
如果我使用curl
或wget
作为下载方式,结果是一样的。我在这里错过了什么?是调用 warnings()
并解析输出的唯一选项吗?
我已经看到 other questions 关于使用 download.file
,但是 none(我可以找到)实际检索 HTTP 状态代码。
可能最好的选择是直接使用 cURL 库,而不是通过 download.file
包装器,后者不会公开 cURL 的全部功能。我们可以做到这一点,例如,使用 RCurl 包(尽管其他包如 httr 或系统调用也可以实现同样的事情)。直接使用 cURL 将允许您访问 cURL 信息,包括响应代码。例如:
library(RCurl)
curl = getCurlHandle()
x = getURL("https://en.wikipedia.org/asdfasdfasdf", curl = curl)
write(x, 'output.html')
getCurlInfo(curl)$response.code
# [1] 404
虽然上面的第一个选项更简洁,但如果您真的想使用 download.file
,一种可能的方法是使用 withCallingHandlers
try(withCallingHandlers(
download.file(url, destfile = "output.html", method = "libcurl"),
warning = function(w) {
my.warning <<- sub(".+HTTP status was ", "", w)
}),
silent = TRUE)
cat(my.warning)
'404 Not Found'
如果您不介意使用其他方法,您可以尝试 httr
包中的 GET
:
url_200 <- "https://en.wikipedia.org/wiki/R_(programming_language)"
url_404 <- "https://en.wikipedia.org/asdfasdfasdf"
# OK
raw_200 <- httr::GET(url_200)
raw_200$status_code
#> [1] 200
# Not found
raw_404 <- httr::GET(url_404)
raw_404$status_code
#> [1] 404
由 reprex package (v0.2.1)
创建于 2019-01-02