无法在 R 中下载文件 - 状态 503
Cannot download file in R - status 503
我正在尝试下载文件:
> URL <- "https://www.bitmarket.pl/graphs/BTCPLN/90m.json"
> download.file(URL, destfile = "res.json", method = "curl")
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4676 0 4676 0 0 56930 0 --:--:-- --:--:-- --:--:-- 57024
但它 return 的 503 状态。总输出:
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
<title>Just a moment...</title>
<style type="text/css">
html, body {width: 100%; height: 100%; margin: 0; padding: 0;}
body {background-color: #ffffff; font-family: Helvetica, Arial, sans-serif; font-size: 100%;}
h1 {font-size: 1.5em; color: #404040; text-align: center;}
p {font-size: 1em; color: #404040; text-align: center; margin: 10px 0 0 0;}
#spinner {margin: 0 auto 30px auto; display: block;}
.attribution {margin-top: 20px;}
@-webkit-keyframes bubbles { 33%: { -webkit-transform: translateY(10px); transform: translateY(10px); } 66% { -webkit-transform: translateY(-10px); transform: translateY(-10px); } 100% { -webkit-transform: translateY(0); transform: translateY(0); } }
@keyframes bubbles { 33%: { -webkit-transform: translateY(10px); transform: translateY(10px); } 66% { -webkit-transform: translateY(-10px); transform: translateY(-10px); } 100% { -webkit-transform: translateY(0); transform: translateY(0); } }
.bubbles { background-color: #404040; width:15px; height: 15px; margin:2px; border-radius:100%; -webkit-animation:bubbles 0.6s 0.07s infinite ease-in-out; animation:bubbles 0.6s 0.07s infinite ease-in-out; -webkit-animation-fill-mode:both; animation-fill-mode:both; display:inline-block; }
</style>
<script type="text/javascript">
//<![CDATA[
(function(){
var a = function() {try{return !!window.addEventListener} catch(e) {return !1} },
b = function(b, c) {a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)};
b(function(){
var a = document.getElementById('cf-content');a.style.display = 'block';
setTimeout(function(){
var s,t,o,p,b,r,e,a,k,i,n,g,f, eoQNdpG={"GwwAAtfX":+((+!![]+[])+(+!![]))};
t = document.createElement('div');
t.innerHTML="<a href='/'>x</a>";
t = t.firstChild.href;r = t.match(/https?:\/\//)[0];
t = t.substr(r.length); t = t.substr(0,t.length-1);
a = document.getElementById('jschl-answer');
f = document.getElementById('challenge-form');
;eoQNdpG.GwwAAtfX+=+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]+!![]));eoQNdpG.GwwAAtfX*=+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX-=+((!+[]+!![]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX-=+((+!![]+[])+(+[]));eoQNdpG.GwwAAtfX-=+((+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX+=+((!+[]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX+=+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX*=+((!+[]+!![]+!![]+[])+(!+[]+!![]+!![]));a.value = parseInt(eoQNdpG.GwwAAtfX, 10) + t.length; '; 121'
f.action += location.hash;
f.submit();
}, 4000);
}, false);
})();
//]]>
</script>
</head>
<body>
<table width="100%" height="100%" cellpadding="20">
<tr>
<td align="center" valign="middle">
<div class="cf-browser-verification cf-im-under-attack">
<noscript><h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1></noscript>
<div id="cf-content" style="display:none">
<div>
<div class="bubbles"></div>
<div class="bubbles"></div>
<div class="bubbles"></div>
</div>
<h1><span data-translate="checking_browser">Checking your browser before accessing</span> bitmarket.pl.</h1>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs">Please allow up to 5 seconds…</p>
</div>
<form id="challenge-form" action="/cdn-cgi/l/chk_jschl" method="get">
<input type="hidden" name="jschl_vc" value="51a7cb71596dbf54fdd307c1e65de941"/>
<input type="hidden" name="pass" value="1512824604.589-Uwtm9TfzWe"/>
<input type="hidden" id="jschl-answer" name="jschl_answer"/>
</form>
</div>
<div class="attribution">
<a href="https://www.cloudflare.com/5xx-error-landing?utm_source=iuam" target="_blank" style="font-size: 12px;">DDoS protection by Cloudflare</a>
<br>
Ray ID: 3ca829f9aed06afb
</div>
</td>
</tr>
</table>
</body>
</html>
wget
也不行:
--2017-12-09 14:01:29-- https://www.bitmarket.pl/graphs/BTCPLN/90m.json
Resolving www.bitmarket.pl... 104.20.67.184, 104.20.68.184
Connecting to www.bitmarket.pl|104.20.67.184|:443... connected.
HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2017-12-09 14:01:29 ERROR 503: Service Temporarily Unavailable.
但是当您转到此 link 时:https://www.bitmarket.pl/graphs/BTCPLN/90m.json 您的网络浏览器将 return 更正 json 文件。知道为什么它不起作用吗?
那是因为该页面使用了 DDoS 保护服务。在第一次加载时,页面本身会在 5 秒后执行 JavaScript 启动的重定向以获取最终内容,因此该过程会因 wget/curl 等不解释 JavaScript 的工具而失败。如果您认为这样做是合理的,那么一种选择是使用 phantomjs
并提供自定义脚本(例如,save.js
):
var system = require('system');
var page = require('webpage').create();
page.userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5';
page.open(system.args[1], function(){
setTimeout(function(){
console.log(page.evaluate(function(){
//gets the JSON from the first <pre> element rendered on the page
return document.getElementsByTagName('pre')[0].textContent;
}));
phantom.exit();
}, 6000); //waits 6 seconds for the page to reload
});
然后用它代替 wget
作为:
phantomjs save.js https://www.bitmarket.pl/graphs/BTCPLN/90m.json
无需离开 R。我们可以为此使用 V8
包并创建一个特殊的 GET
函数:
#' Work around cloudflare anti-DDoS protection
#'
#' SUPER FRAGILE AS IT NEEDS TO BE MODIFIED WHENEVER CLOUDFLARE CHANGES THE CHALLENGE CODE
#'
#' @param cf_url the URL you want
#' @param ... other params passed to all `httr::GET`` calls (headers, verbose, etc)
#' @return `httr::response object``
cf_GET <- function(cf_url, ...) {
require(urltools)
require(stringi)
require(rvest)
library(httr)
require(V8)
c(
ua_macos_chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36",
ua_ios_safari = "Mozilla/5.0 (iPad; CPU OS 10_2 like Mac OS X) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0 Mobile/14C92 Safari/602.1",
ua_win7_firefox = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
) -> agents
# use a valid browser user-agent but don't always use the same one
(cf_agent <- unname(sample(agents, 1)))
httr::GET(
url = cf_url,
httr::user_agent(cf_agent),
...
) -> res
# sometimes you get lucky and the page comes back
if (!httr::status_code(res) == 503) return(res) # return now if no cf redirect
# get the page
cf_pg <- httr::content(res, as="parsed")
# get form/form variables we'll need later
(jschl_vc <- html_attr(html_node(cf_pg, "input[name='jschl_vc']"), "value"))
(pass <- html_attr(html_node(cf_pg, "input[name='pass']"), "value"))
(action <- html_attr(html_node(cf_pg, "form[id='challenge-form']"), "action"))
# get the page as just lines of text
cf_code <- httr::content(res, as="text")
writeLines(cf_code, "/tmp/a.html")
cf_code <- stri_split_lines(cf_code)[[1]]
# find the javascript
decl <- cf_code[which(stri_detect_fixed(cf_code, "s,t,o,p,b"))]
(init_line <- stri_match_first_regex(decl, "s,t,o,p,b[[:alpha:], ]+ (.*$)")[,2])
(var_name <- stri_match_first_regex(init_line, "([[:alnum:]]+)")[,2])
(exec_line <- cf_code[which(stri_detect_fixed(cf_code, var_name))[2]])
# tweak and execute the javascript
ctx <- v8()
ctx$eval(sprintf("var a = {}; t = '%s';%s\n%s", domain(cf_url), decl, exec_line))
(ctx$get("a.value"))
# this lying but you can wait 10s
message("Waiting 5 seconds...")
Sys.sleep(10)
# solve the DDoS challenge and make the request
httr::GET(
url = sprintf("%s://%s/%s", scheme(cf_url), domain(cf_url), action),
httr::user_agent(cf_agent),
httr::add_headers(
`Referer` = cf_url
),
query = list(
`jschl-answer` = ctx$get("a.value"),
jschl_vc = jschl_vc,
pass = pass
),
...
) -> res
res
}
而且,它有效:
res <- cf_GET("https://www.bitmarket.pl/graphs/BTCPLN/90m.json")
str(content(res, as="parsed"))
## List of 90
## $ :List of 6
## ..$ time : int 1512906360
## ..$ open : chr "48303.78770000"
## ..$ high : chr "48303.78770000"
## ..$ low : chr "48303.78770000"
## ..$ close: chr "48303.78770000"
## ..$ vol : chr "0.13550275"
## $ :List of 6
## ..$ time : int 1512906420
## ..$ open : chr "48303.78770000"
## ..$ high : chr "48303.78770000"
## ..$ low : chr "48000.10000000"
## ..$ close: chr "48000.10000000"
## ..$ vol : chr "1.12078334"
## ...
更新:
我把这个包裹在一个包裹里:
devtools::install_github("hrbrmstr/cfhttr")
library(cfhttr)
res <- cf_GET("https://www.bitmarket.pl/graphs/BTCPLN/90m.json")
(相同的输出)
我正在尝试下载文件:
> URL <- "https://www.bitmarket.pl/graphs/BTCPLN/90m.json"
> download.file(URL, destfile = "res.json", method = "curl")
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4676 0 4676 0 0 56930 0 --:--:-- --:--:-- --:--:-- 57024
但它 return 的 503 状态。总输出:
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
<title>Just a moment...</title>
<style type="text/css">
html, body {width: 100%; height: 100%; margin: 0; padding: 0;}
body {background-color: #ffffff; font-family: Helvetica, Arial, sans-serif; font-size: 100%;}
h1 {font-size: 1.5em; color: #404040; text-align: center;}
p {font-size: 1em; color: #404040; text-align: center; margin: 10px 0 0 0;}
#spinner {margin: 0 auto 30px auto; display: block;}
.attribution {margin-top: 20px;}
@-webkit-keyframes bubbles { 33%: { -webkit-transform: translateY(10px); transform: translateY(10px); } 66% { -webkit-transform: translateY(-10px); transform: translateY(-10px); } 100% { -webkit-transform: translateY(0); transform: translateY(0); } }
@keyframes bubbles { 33%: { -webkit-transform: translateY(10px); transform: translateY(10px); } 66% { -webkit-transform: translateY(-10px); transform: translateY(-10px); } 100% { -webkit-transform: translateY(0); transform: translateY(0); } }
.bubbles { background-color: #404040; width:15px; height: 15px; margin:2px; border-radius:100%; -webkit-animation:bubbles 0.6s 0.07s infinite ease-in-out; animation:bubbles 0.6s 0.07s infinite ease-in-out; -webkit-animation-fill-mode:both; animation-fill-mode:both; display:inline-block; }
</style>
<script type="text/javascript">
//<![CDATA[
(function(){
var a = function() {try{return !!window.addEventListener} catch(e) {return !1} },
b = function(b, c) {a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)};
b(function(){
var a = document.getElementById('cf-content');a.style.display = 'block';
setTimeout(function(){
var s,t,o,p,b,r,e,a,k,i,n,g,f, eoQNdpG={"GwwAAtfX":+((+!![]+[])+(+!![]))};
t = document.createElement('div');
t.innerHTML="<a href='/'>x</a>";
t = t.firstChild.href;r = t.match(/https?:\/\//)[0];
t = t.substr(r.length); t = t.substr(0,t.length-1);
a = document.getElementById('jschl-answer');
f = document.getElementById('challenge-form');
;eoQNdpG.GwwAAtfX+=+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]+!![]));eoQNdpG.GwwAAtfX*=+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX-=+((!+[]+!![]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX-=+((+!![]+[])+(+[]));eoQNdpG.GwwAAtfX-=+((+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX+=+((!+[]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX+=+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));eoQNdpG.GwwAAtfX*=+((!+[]+!![]+!![]+[])+(!+[]+!![]+!![]));a.value = parseInt(eoQNdpG.GwwAAtfX, 10) + t.length; '; 121'
f.action += location.hash;
f.submit();
}, 4000);
}, false);
})();
//]]>
</script>
</head>
<body>
<table width="100%" height="100%" cellpadding="20">
<tr>
<td align="center" valign="middle">
<div class="cf-browser-verification cf-im-under-attack">
<noscript><h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1></noscript>
<div id="cf-content" style="display:none">
<div>
<div class="bubbles"></div>
<div class="bubbles"></div>
<div class="bubbles"></div>
</div>
<h1><span data-translate="checking_browser">Checking your browser before accessing</span> bitmarket.pl.</h1>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs">Please allow up to 5 seconds…</p>
</div>
<form id="challenge-form" action="/cdn-cgi/l/chk_jschl" method="get">
<input type="hidden" name="jschl_vc" value="51a7cb71596dbf54fdd307c1e65de941"/>
<input type="hidden" name="pass" value="1512824604.589-Uwtm9TfzWe"/>
<input type="hidden" id="jschl-answer" name="jschl_answer"/>
</form>
</div>
<div class="attribution">
<a href="https://www.cloudflare.com/5xx-error-landing?utm_source=iuam" target="_blank" style="font-size: 12px;">DDoS protection by Cloudflare</a>
<br>
Ray ID: 3ca829f9aed06afb
</div>
</td>
</tr>
</table>
</body>
</html>
wget
也不行:
--2017-12-09 14:01:29-- https://www.bitmarket.pl/graphs/BTCPLN/90m.json
Resolving www.bitmarket.pl... 104.20.67.184, 104.20.68.184
Connecting to www.bitmarket.pl|104.20.67.184|:443... connected.
HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2017-12-09 14:01:29 ERROR 503: Service Temporarily Unavailable.
但是当您转到此 link 时:https://www.bitmarket.pl/graphs/BTCPLN/90m.json 您的网络浏览器将 return 更正 json 文件。知道为什么它不起作用吗?
那是因为该页面使用了 DDoS 保护服务。在第一次加载时,页面本身会在 5 秒后执行 JavaScript 启动的重定向以获取最终内容,因此该过程会因 wget/curl 等不解释 JavaScript 的工具而失败。如果您认为这样做是合理的,那么一种选择是使用 phantomjs
并提供自定义脚本(例如,save.js
):
var system = require('system');
var page = require('webpage').create();
page.userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5';
page.open(system.args[1], function(){
setTimeout(function(){
console.log(page.evaluate(function(){
//gets the JSON from the first <pre> element rendered on the page
return document.getElementsByTagName('pre')[0].textContent;
}));
phantom.exit();
}, 6000); //waits 6 seconds for the page to reload
});
然后用它代替 wget
作为:
phantomjs save.js https://www.bitmarket.pl/graphs/BTCPLN/90m.json
无需离开 R。我们可以为此使用 V8
包并创建一个特殊的 GET
函数:
#' Work around cloudflare anti-DDoS protection
#'
#' SUPER FRAGILE AS IT NEEDS TO BE MODIFIED WHENEVER CLOUDFLARE CHANGES THE CHALLENGE CODE
#'
#' @param cf_url the URL you want
#' @param ... other params passed to all `httr::GET`` calls (headers, verbose, etc)
#' @return `httr::response object``
cf_GET <- function(cf_url, ...) {
require(urltools)
require(stringi)
require(rvest)
library(httr)
require(V8)
c(
ua_macos_chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36",
ua_ios_safari = "Mozilla/5.0 (iPad; CPU OS 10_2 like Mac OS X) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0 Mobile/14C92 Safari/602.1",
ua_win7_firefox = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
) -> agents
# use a valid browser user-agent but don't always use the same one
(cf_agent <- unname(sample(agents, 1)))
httr::GET(
url = cf_url,
httr::user_agent(cf_agent),
...
) -> res
# sometimes you get lucky and the page comes back
if (!httr::status_code(res) == 503) return(res) # return now if no cf redirect
# get the page
cf_pg <- httr::content(res, as="parsed")
# get form/form variables we'll need later
(jschl_vc <- html_attr(html_node(cf_pg, "input[name='jschl_vc']"), "value"))
(pass <- html_attr(html_node(cf_pg, "input[name='pass']"), "value"))
(action <- html_attr(html_node(cf_pg, "form[id='challenge-form']"), "action"))
# get the page as just lines of text
cf_code <- httr::content(res, as="text")
writeLines(cf_code, "/tmp/a.html")
cf_code <- stri_split_lines(cf_code)[[1]]
# find the javascript
decl <- cf_code[which(stri_detect_fixed(cf_code, "s,t,o,p,b"))]
(init_line <- stri_match_first_regex(decl, "s,t,o,p,b[[:alpha:], ]+ (.*$)")[,2])
(var_name <- stri_match_first_regex(init_line, "([[:alnum:]]+)")[,2])
(exec_line <- cf_code[which(stri_detect_fixed(cf_code, var_name))[2]])
# tweak and execute the javascript
ctx <- v8()
ctx$eval(sprintf("var a = {}; t = '%s';%s\n%s", domain(cf_url), decl, exec_line))
(ctx$get("a.value"))
# this lying but you can wait 10s
message("Waiting 5 seconds...")
Sys.sleep(10)
# solve the DDoS challenge and make the request
httr::GET(
url = sprintf("%s://%s/%s", scheme(cf_url), domain(cf_url), action),
httr::user_agent(cf_agent),
httr::add_headers(
`Referer` = cf_url
),
query = list(
`jschl-answer` = ctx$get("a.value"),
jschl_vc = jschl_vc,
pass = pass
),
...
) -> res
res
}
而且,它有效:
res <- cf_GET("https://www.bitmarket.pl/graphs/BTCPLN/90m.json")
str(content(res, as="parsed"))
## List of 90
## $ :List of 6
## ..$ time : int 1512906360
## ..$ open : chr "48303.78770000"
## ..$ high : chr "48303.78770000"
## ..$ low : chr "48303.78770000"
## ..$ close: chr "48303.78770000"
## ..$ vol : chr "0.13550275"
## $ :List of 6
## ..$ time : int 1512906420
## ..$ open : chr "48303.78770000"
## ..$ high : chr "48303.78770000"
## ..$ low : chr "48000.10000000"
## ..$ close: chr "48000.10000000"
## ..$ vol : chr "1.12078334"
## ...
更新:
我把这个包裹在一个包裹里:
devtools::install_github("hrbrmstr/cfhttr")
library(cfhttr)
res <- cf_GET("https://www.bitmarket.pl/graphs/BTCPLN/90m.json")
(相同的输出)