从 URL 下载 Java 中的文件 1) 您不知道扩展名 [例如 .jpg] 或 2) 正在重定向到一个文件

Question

问题是虽然我知道如何从 URL 下载 File，例如：

http://i12.photobucket.com/albums/a206/zxc6/1_zps3e6rjofn.jpg

当涉及到如下文件时：

https://images.duckduckgo.com/iu/?u=http%3......

不知道怎么下载

我使用 IOUtils 下载文件的代码在扩展名可见时效果很好，但在上面的例子中 returns :

java.io.IOException: Server returned HTTP response code: 500 for URL: https://images.duckduckgo.com/iu/?u=http%3A%2F%2Fimages2.fanpop.com%2Fimage%2Fphotos%2F8900000%2FFirefox-firefox-8967915-1600-1200.jpg&f=1

即使删除 &f=1。

Downloader 的代码（用于测试目的......原型）：

import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;

import org.apache.commons.io.IOUtils;

public class Downloader {

    private static class ProgressListener implements ActionListener {

    @Override
    public void actionPerformed(ActionEvent e) {
        // e.getSource() gives you the object of
        // DownloadCountingOutputStream
        // because you set it in the overriden method, afterWrite().
        System.out.println("Downloaded bytes : " + ((DownloadProgressListener) e.getSource()).getByteCount());
    }
    }

    /**
     * Main Method
     * 
     * @param args
     */
    public static void main(String[] args) {
    URL dl = null;
    File fl = null;
    String x = null;
    OutputStream os = null;
    InputStream is = null;
    ProgressListener progressListener = new ProgressListener();
    try {
        fl = new File(System.getProperty("user.home").replace("\", "/") + "/Desktop/image.jpg");
        dl = new URL(
            "https://images.duckduckgo.com/iu/?u=http%3A%2F%2Fimages2.fanpop.com%2Fimage%2Fphotos%2F8900000%2FFirefox-firefox-8967915-1600-1200.jpg&f=1");
        os = new FileOutputStream(fl);
        is = dl.openStream();

        // http://i12.photobucket.com/albums/a206/zxc6/1_zps3e6rjofn.jpg

        DownloadProgressListener dcount = new DownloadProgressListener(os);
        dcount.setListener(progressListener);

        URLConnection connection = dl.openConnection();

        // this line give you the total length of source stream as a String.
        // you may want to convert to integer and store this value to
        // calculate percentage of the progression.
        System.out.println("Content Length:" + connection.getHeaderField("Content-Length"));
        System.out.println("Content Length with different way:" + connection.getContentType());

        System.out.println("\n");

        // begin transfer by writing to dcount, not os.
        IOUtils.copy(is, dcount);

    } catch (Exception e) {
        System.out.println(e);
    } finally {
        IOUtils.closeQuietly(os);
        IOUtils.closeQuietly(is);
    }
    }
}

DownloadProgressListener的代码：

import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.IOException;
import java.io.OutputStream;

import org.apache.commons.io.output.CountingOutputStream;

public class DownloadProgressListener extends CountingOutputStream {

    private ActionListener listener = null;

    public DownloadProgressListener(OutputStream out) {
    super(out);
    }

    public void setListener(ActionListener listener) {
    this.listener = listener;
    }

    @Override
    protected void afterWrite(int n) throws IOException {
    super.afterWrite(n);
    if (listener != null) {
        listener.actionPerformed(new ActionEvent(this, 0, null));
    }
    }

}

发帖前我已阅读的问题：

1)Download file from url that doesn't end with .extension

2)http://www.mkyong.com/java/how-to-get-url-content-in-java/

3)Download file using java apache commons?

4)How to download and save a file from Internet using Java?

5)How to create file object from URL object

Answer 1

正如评论中所指出的，扩展名无关紧要。

这里的问题是试图下载可能是 re-direct 或者可能只是异步调用参数的东西。

您的 Extra big url without extension 已损坏，但我可以回答其他类型的潜在解决方案。

如果您观察到 URL:

https://images.duckduckgo.com/iu/?u=http%3A%2F%2Fimages2.fan‌pop.com%2Fimage%2Fph‌otos%2F8900000%2FFir‌efox-firefox-8967915‌-1600-1200.jpg&f=1

图像的 URL 实际上就在那里。它只是编码的，应该很容易解码。 Java(java.net.URLDecoder)里面有解码库，如果你想自己做，可以这样看：

http%3A%2F%2Fimages2.fan‌pop.com%2Fimage%2Fph‌otos%2F8900000%2FFir‌efox-firefox-8967915‌-1600-1200.jpg&f=1

编码部分是%XX，其中XX是任意两个字符。查看 HTML 编码 table，您会发现 %3A 显然是一个冒号。 %2F 是正斜杠。

如果您替换所有编码实体，您将得到： http://images2.fan‌pop.com/image/ph‌otos/8900000/Fir‌efox-firefox-8967915‌-1600-1200.jpg&f=1

在这种情况下，您不需要额外的参数，因此您可以丢弃 &f=1 并从原始 URL 下载图像。在大多数情况下，我想你可以保留额外的参数，它会被忽略。

--

简而言之：

提取原文URL
解码
下载

我想指出这是一个脆弱的解决方案，如果 URL 模式发生变化，它就会崩溃，或者需要大量维护。如果您的目标不仅仅是一小部分用户，您应该 re-think 您的方法。

HTML URL encoding table

Answer 2

如果您想要 "quick and dirty" 解决问题的方法，请查看@Christopher Schneider 的回答。（但如果 DuckDuckGo 的 URL 语法发生变化，它可能会崩溃......）

我做了一些挖掘（使用 curl --trace-ascii，等等）。这不是重定向的问题。根据curl，500是对请求的立即响应。

所以我最好的猜测是这种行为是 "by design"。服务器正在查看请求 headers（例如 "User-Agent" header）并确定您的请求看起来不像来自受支持的浏览器。 500 响应是有意或无意的混淆。

为什么？

很可能，运行 DuckDuckGo 的人不希望您使用该服务器端点进行自动下载、抓取等。他们对此并不完全清楚，但这 link 可以在某种程度上解释：

https://duckduckgo.com/api

解决方案？

别这样！看看你是否可以使用他们的官方 API（见上文）做你想做的事情。如果这不起作用，联系他们。

从 URL 下载 Java 中的文件 1) 您不知道扩展名 [例如 .jpg] 或 2) 正在重定向到一个文件

Download File in Java from URL 1)where you don't know the extension[eg .jpg] or 2)is redirecting to a File

java

file

urlconnection

download