org.jsoup.HttpStatusException:获取 HTTP 错误 URL。 Status=504 尝试抓取 HTML 内容时出错

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=504 Error while trying to scrape HTML content

我想从下面列出的 URL 中抓取 HTML 代码。问题是,我得到这个错误:-

Aug 14, 2016 6:40:36 PM booksscraper.BooksScraper main SEVERE: null org.jsoup.HttpStatusException: HTTP error fetching URL. Status=504, URL=http://www.bkstr.com/webapp/wcs/stores/servlet/CourseMaterialsResultsView?catalogId=10001&categoryId=9604&storeId=10293&langId=-1&programId=636&termId=100043741&divisionDisplayName=%20&departmentDisplayName=ACCG&courseDisplayName=16971&sectionDisplayName=P15%20DAVIS&demoKey=d&purpose=browse at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216) at booksscraper.BooksScraper.main(BooksScraper.java:52)

我已将超时设置为无限大,但这没有帮助。该网站的 HTML 代码非常大,即 14833 行代码。这是问题的原因吗?

String url = "http://www.bkstr.com/webapp/wcs/stores/servlet/CourseMaterialsResultsView?catalogId=10001&categoryId=9604&storeId=10293&langId=-1&programId=636&termId=100043741&divisionDisplayName=%20&departmentDisplayName=ACCG&courseDisplayName=16971&sectionDisplayName=P15%20DAVIS&demoKey=d&purpose=browse";

Document doc = Jsoup.connect(url)
                .maxBodySize(0)
                .timeout(0)
                .get();

System.out.println(doc);

这不是 Jsoup API 或您的代码问题。报错原因是URL is not responding and throwing "Gateway Timeout" error message (代理服务器没有收到上游服务器的及时响应).

来自您的程序的异常消息:-

HTTP error fetching URL. Status=504

HTTP 状态代码:504

504 Gateway Timeout

The server, while acting as a gateway or proxy, did not receive a timely response from the upstream server specified by the URI (e.g. HTTP, FTP, LDAP) or some other auxiliary server (e.g. DNS) it needed to access in attempting to complete the request.

  Note: Note to implementors: some deployed proxies are known to
  return 400 or 500 when DNS lookups time out.

我通过将 UserAgent 设置为 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36(KHTML,如 Gecko)Chrome/51.0.2704.106 Safari/537.36。但是,大约需要 4 分钟才能回复。