使用 java 从网站保存文件

Save file from a website with java

我正在尝试构建 jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene

我可以下载 link 但是当我尝试 'go' link 时出现 未处理的内容类型 错误].这是我的代码

public static void main(String[] args) {
    try {
           String videoName = JOptionPane.showInputDialog("Title: ");
         subscene(videoName);
       }
       catch (Exception e) {
           System.out.println(e.getMessage());
       }
}

public static void subscene(String videoName){
       try {
           String siteName = "http://www.subscene.com";
           String[] splits = videoName.split("\s+");
           String codeName = "";
           String text = "";
           if(splits.length>1){
               for(int i=0;i<splits.length;i++){
                   codeName = codeName+splits[i]+"-";
               }
               videoName = codeName.substring(0, videoName.length());
           }
           System.out.println("videoName is "+videoName);
          // String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
           String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
           System.out.println("url is "+url);
           Document doc = Jsoup.connect(url).get();
           Element exact = doc.select("h2.exact").first();
           Element yuel = exact.nextElementSibling();
           Elements lis = yuel.children();

               System.out.println(lis.first().children().text());
               String hRef = lis.select("div.title > a").attr("href");
               hRef = siteName+hRef+"/english";

           System.out.println("hRef is "+hRef);
           doc = Jsoup.connect(hRef).get();

           Element nonHI = doc.select("td.a40").first();
           Element papa = nonHI.parent();
           Element link = papa.select("a").first();
           text = link.text();
           System.out.println("Subtitle is "+text);
           hRef = link.attr("href");
           hRef = siteName+hRef;

           Document subDownloadPage = Jsoup.connect(hRef).get();
           hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
           Jsoup.connect(hRef).get(); //<-- Here's where the problem lies

           }
           catch (java.io.IOException e) {
               System.out.println(e.getMessage());
           }
   }

有人可以帮助我,这样我就不必手动下载订阅了吗?

我刚刚发现使用

java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));

而不是

Jsoup.connect(hRef).get();

提示我保存后下载文件。但我不想被提示,因为这样我将无法读取下载的 zip 文件的名称(我想在使用 java 保存后解压缩)。

这里:

Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();

看起来 jsoup 期望 Jsoup.connect(hRef) 的结果应该是 HTML 或它能够解析的一些文本,这就是消息状态的原因:

Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml

我手动执行了你的代码,最后 URL 你试图访问 returns 内容类型 application/x-zip-compressed,因此是异常的原因。

要下载此文件,您应该使用不同的方法。您可以使用旧的但仍然有用的 URLConnection, URL 或使用像 Apache HttpComponents 这样的第三方库来触发 GET 请求并将结果检索为 InputStream,将其包装到适当的编写器中并将您的文件写入你的磁盘。

这是一个使用 URL:

执行此操作的示例
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
    out.write(buffer, 0, length);
}
out.close();
in.close();

假设你的文件很小,你可以这样做。请注意,您可以告诉 Jsoup 忽略内容类型。

// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();

// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();

我建议在保存前验证内容。 因为某些服务器在找不到文件时只会 return 一个 HTML 页面,即损坏的超链接。

...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
  throw new IllegalStateException("invalid file content");
}
...