HtmlUnit:正在保存 pdf link
HtmlUnit: saving pdf link
如何使用 HtmlUnit 从网站下载 pdfLink?
HtmlClient.getPage() 中的默认 return 是一个 HtmlPage。这不处理 pdf 文件。
答案是,如果响应不是 html 文件,HtmlClient.getPage 将 return 一个 UnexpectedPage。然后你可以将pdf作为输入流并保存。
private void grabPdf(String urlNow)
{
OutputStream outStream =null;
InputStream is = null;
try
{
if(urlNow.endsWith(".pdf"))
{
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
try
{
setWebClientOptions(webClient);
final UnexpectedPage pdfPage = webClient.getPage(urlNow);
is = pdfPage.getWebResponse().getContentAsStream();
String fileName = "myfilename";
fileName = fileName.replaceAll("[^A-Za-z0-9]", "");
File targetFile = new File(outputPath + File.separator + fileName + ".pdf");
outStream = new FileOutputStream(targetFile);
byte[] buffer = new byte[8 * 1024];
int bytesRead;
while ((bytesRead = is.read(buffer)) != -1)
{
outStream.write(buffer, 0, bytesRead);
}
}
catch (Exception e)
{
NioLog.getLogger().error(e.getMessage(), e);
}
finally
{
webClient.close();
if(null!=is)
{
is.close();
}
if(null!=outStream)
{
outStream.close();
}
}
}
}
catch (Exception e)
{
NioLog.getLogger().error(e.getMessage(), e);
}
}
旁注。我没有对资源使用 try,因为输出流只能在 try 块中初始化。我可以分为两种方法,但对于程序员来说,这在认知上会变慢。
private boolean grabPdf(String url, File output) {
FileOutputStream outStream = null;
InputStream is = null;
try {
final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
try {
final UnexpectedPage pdfPage = webClient.getPage(url);
is = pdfPage.getWebResponse().getContentAsStream();
outStream = new FileOutputStream(output);
byte[] buffer = new byte[8 * 1024];
int bytesRead;
while ((bytesRead = is.read(buffer)) != -1) {
outStream.write(buffer, 0, bytesRead);
}
return true;
} catch (Exception e) {
e.printStackTrace();
} finally {
if(webClient != null)
webClient.close();
if(is != null)
is.close();
if(outStream != null)
outStream.close();
}
} catch (Exception e) {
e.printStackTrace();
}
return false;
}
建议修改但被拒绝。这个答案在原来的基础上改进了:
- 返回
boolean
是否下载
- 适用于不以 .pdf 结尾的链接
- 采用
File
参数来保存文件,而不是在方法中对其进行硬编码
- 将 FIREFOX 更改为 BEST_SUPPORTED,因为它是更通用的建议(但用户可能希望根据自己的需要进行更改)
如何使用 HtmlUnit 从网站下载 pdfLink? HtmlClient.getPage() 中的默认 return 是一个 HtmlPage。这不处理 pdf 文件。
答案是,如果响应不是 html 文件,HtmlClient.getPage 将 return 一个 UnexpectedPage。然后你可以将pdf作为输入流并保存。
private void grabPdf(String urlNow)
{
OutputStream outStream =null;
InputStream is = null;
try
{
if(urlNow.endsWith(".pdf"))
{
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
try
{
setWebClientOptions(webClient);
final UnexpectedPage pdfPage = webClient.getPage(urlNow);
is = pdfPage.getWebResponse().getContentAsStream();
String fileName = "myfilename";
fileName = fileName.replaceAll("[^A-Za-z0-9]", "");
File targetFile = new File(outputPath + File.separator + fileName + ".pdf");
outStream = new FileOutputStream(targetFile);
byte[] buffer = new byte[8 * 1024];
int bytesRead;
while ((bytesRead = is.read(buffer)) != -1)
{
outStream.write(buffer, 0, bytesRead);
}
}
catch (Exception e)
{
NioLog.getLogger().error(e.getMessage(), e);
}
finally
{
webClient.close();
if(null!=is)
{
is.close();
}
if(null!=outStream)
{
outStream.close();
}
}
}
}
catch (Exception e)
{
NioLog.getLogger().error(e.getMessage(), e);
}
}
旁注。我没有对资源使用 try,因为输出流只能在 try 块中初始化。我可以分为两种方法,但对于程序员来说,这在认知上会变慢。
private boolean grabPdf(String url, File output) {
FileOutputStream outStream = null;
InputStream is = null;
try {
final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
try {
final UnexpectedPage pdfPage = webClient.getPage(url);
is = pdfPage.getWebResponse().getContentAsStream();
outStream = new FileOutputStream(output);
byte[] buffer = new byte[8 * 1024];
int bytesRead;
while ((bytesRead = is.read(buffer)) != -1) {
outStream.write(buffer, 0, bytesRead);
}
return true;
} catch (Exception e) {
e.printStackTrace();
} finally {
if(webClient != null)
webClient.close();
if(is != null)
is.close();
if(outStream != null)
outStream.close();
}
} catch (Exception e) {
e.printStackTrace();
}
return false;
}
建议修改但被拒绝。这个答案在原来的基础上改进了:
- 返回
boolean
是否下载 - 适用于不以 .pdf 结尾的链接
- 采用
File
参数来保存文件,而不是在方法中对其进行硬编码 - 将 FIREFOX 更改为 BEST_SUPPORTED,因为它是更通用的建议(但用户可能希望根据自己的需要进行更改)