HtmlUnit:正在保存 pdf link

HtmlUnit: saving pdf link

如何使用 HtmlUnit 从网站下载 pdfLink? HtmlClient.getPage() 中的默认 return 是一个 HtmlPage。这不处理 pdf 文件。

答案是,如果响应不是 html 文件,HtmlClient.getPage 将 return 一个 UnexpectedPage。然后你可以将pdf作为输入流并保存。

private void grabPdf(String urlNow)
{
    OutputStream outStream =null;
    InputStream is = null;
    try
    {
        if(urlNow.endsWith(".pdf"))
        {
            final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
            try
            {
                setWebClientOptions(webClient);
                final UnexpectedPage pdfPage = webClient.getPage(urlNow);
                is = pdfPage.getWebResponse().getContentAsStream();

                String fileName = "myfilename";
                fileName = fileName.replaceAll("[^A-Za-z0-9]", "");

                File targetFile = new File(outputPath + File.separator + fileName  + ".pdf");
                outStream = new FileOutputStream(targetFile);
                byte[] buffer = new byte[8 * 1024];
                int bytesRead;
                while ((bytesRead = is.read(buffer)) != -1)
                {
                    outStream.write(buffer, 0, bytesRead);
                }


            }
            catch (Exception e)
            {
                NioLog.getLogger().error(e.getMessage(), e);
            }
            finally
            {
                webClient.close();
                if(null!=is)
                {
                    is.close();
                }
                if(null!=outStream)
                {
                    outStream.close();
                }
            }
        }
    }
    catch (Exception e)
    {
        NioLog.getLogger().error(e.getMessage(), e);
    }

}

旁注。我没有对资源使用 try,因为输出流只能在 try 块中初始化。我可以分为两种方法,但对于程序员来说,这在认知上会变慢。

private boolean grabPdf(String url, File output) {
    FileOutputStream outStream = null;
    InputStream is = null;
    try {
        final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
        try {
            final UnexpectedPage pdfPage = webClient.getPage(url);
            is = pdfPage.getWebResponse().getContentAsStream();
            outStream = new FileOutputStream(output);
            byte[] buffer = new byte[8 * 1024];
            int bytesRead;
            while ((bytesRead = is.read(buffer)) != -1) {
                outStream.write(buffer, 0, bytesRead);
            }
            return true;
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if(webClient != null)
                webClient.close();
            if(is != null)
                is.close();
            if(outStream != null)
                outStream.close();
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return false;
}

建议修改但被拒绝。这个答案在原来的基础上改进了:

  • 返回boolean是否下载
  • 适用于不以 .pdf 结尾的链接
  • 采用 File 参数来保存文件,而不是在方法中对其进行硬编码
  • 将 FIREFOX 更改为 BEST_SUPPORTED,因为它是更通用的建议(但用户可能希望根据自己的需要进行更改)