无法下载 java 中的特定 URL

Question

我正在编写以下程序以使用 Apache Common-IO 下载 URL，但出现 ReadTimeOut 异常，异常

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at sun.security.ssl.InputRecord.readFully(Unknown Source)
at sun.security.ssl.InputRecord.read(Unknown Source)
at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)
at sun.security.ssl.SSLSocketImpl.readDataRecord(Unknown Source)
at sun.security.ssl.AppInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at java.net.URL.openStream(Unknown Source)
at org.apache.commons.io.FileUtils.copyURLToFile(FileUtils.java:1456)
at com.touseef.stock.FileDownload.main(FileDownload.java:23)

计划

String urlStr = "https://www.nseindia.com/";
    File file = new File("C:\User\WorkSpace\Output.txt");
    URL url;
    try {
        url = new URL(urlStr);
        FileUtils.copyURLToFile(url, file);
        System.out.println("Successfully Completed.");
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

其他站点都可以下载。请建议。使用 commons-io-2.6 jar。

Answer 1

似乎此站点受到某些 Web 网关的保护（像 Akamai 这样的 DOS 保护服务？）。客户端似乎被 TLS 连接和 HTTP 请求 (headers) 指纹识别，并且只有有效的 Web 浏览器才能连接到该站点。

以下代码使用 Apache commons http client 4.5 并且至少目前有效：

    String urlStr = "https://www.nseindia.com/";
    File file = new File("C:\User\WorkSpace\Output.txt");
    String userAgent = "-";

    CloseableHttpClient httpclient = HttpClients.custom().setUserAgent(userAgent).build();
    HttpGet httpget = new HttpGet(urlStr);
    httpget.addHeader("Accept-Language", "en-US");
    httpget.addHeader("Cookie", "");

    System.out.println("Executing request " + httpget.getRequestLine());
    try (CloseableHttpResponse response = httpclient.execute(httpget)) {
        System.out.println("----------------------------------------");
        System.out.println(response.getStatusLine());
        String body = EntityUtils.toString(response.getEntity());
        System.out.println(body);
        Files.writeString(file.toPath(), body);
    }

例如在 Firefox 中工作的请求在 Java 中不工作（因为与协议和密码的 TLS 连接不同）。我使用 Apache commons http 客户端尝试了一些组合。但也失败了（即使相同的请求来自 Fiddler）。

因此从 Java 中使用这个网站是非常困难的，即使上面的代码现在可以工作，保护系统可以随时调整，这样它就不会再工作了。

我假设这样的站点提供了一个 API 专供程序使用的站点。联系他们并询问，这是我能给你的唯一建议。

无法下载 java 中的特定 URL

Not able to download specific URL in java

java

url

download

apache-commons