使用 URLConnetion.getInputStream() 获取源代码 (amazon.de)

Question

当我想获取特定网页的源代码时，我使用以下代码：

URL url = new URL("https://google.de");
URLConnection urlConnect = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnect.getInputStream())); //Here is the error with the amazon url
StringBuffer sb = new StringBuffer();
String line, htmlData;
while((line=br.readLine())!=null){
    sb.append(line+"\n");
}
htmlData = sb.toString();

上面的代码没有问题，但是当你的url被调用时...

URL url = new URL("https://amazon.de");

...然后你有时可能会得到一个IOException错误->服务器错误代码503。在我看来，这没有任何意义，因为我可以用浏览器毫无错误地进入亚马逊网页。

Answer 1

当使用 curl -v https://amazon.de 访问 https://amazon.de 时，您会在响应（当遵循重定向时，您会从引用的位置 https://www.amazon.de/ 获得 503）。正文包含以下评论：

To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.de/ref=rm_5_sv, or our Product Advertising API at https://partnernet.amazon.de/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.

我假设当检测到您的请求来自非浏览器上下文（即通过解析用户代理）时亚马逊会返回此响应，以提示您使用 API 而不是直接抓取网站。

使用 URLConnetion.getInputStream() 获取源代码 (amazon.de)

Using URLConnetion.getInputStream() to get source code (amazon.de)

java

urlconnection