如何转储没有相对 URL 的 HTML 文档?
How to dump an HTML document without relative urls?
由于 whosebug.com,我有这个:
Document doc = Jsoup.connect(urlFromUser).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36").timeout(0).get();
doc.absUrl(urlFromUser);
doc.setBaseUri(urlFromUser);
Elements elements = doc.select("body");
Elements imgElements = doc.select("img");
for (Element element : imgElements) {
element.attr("src", element.attr("abs:src"));
}
Elements hrefElements = doc.select("a");
for (Element element : hrefElements) {
element.attr("href", "http://www.some.com/translit/lat2cyr?" + element.attr("abs:href"));
}
Elements linkElements = doc.head().select("link");
for (Element element : linkElements) {
element.attr("href", element.attr("abs:href"));
writer.print("");
manipulateElements(elements);
}
结果是:
<link rel="stylesheet" href="css/windows/windows.css?">
但我需要这个:
<link rel="stylesheet" href="http://DOMAIN.com/css/windows/windows.css?">
我试过了,但没有解决问题:
String host = uri.getHost();
host = "http://" + host;
writer.print(doc.toString().replaceAll("href=\"/css/", "href=\"" + host + "/css/").replaceAll("/jscript/", host + "/jscript/").replaceAll("/styles/", host + "/styles/").replaceAll("/functions/", host + "/functions/").replaceAll("href=\"/templates/", host + "/templates/").replaceAll("href=\"/plugins/", host + "/plugins/").replaceAll("href=\"css/", "href=\"" + host + "/css/"));
writer.close();
为了实现您的目标,您需要定制 OuterHtmlVisitor
。它会生成绝对 url 而不是相对 url。不幸的是,从 JSoup 1.8.3
开始,这个 class 是内部的。
您可以尝试编写自定义 NodeVisitor
实现,但工作量太大。
另一方面,这里有一个解决方法:
// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup //
.connect(urlFromUser) //
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") //
.timeout(0) //
.get();
// Turn any url into an absolute url
String myTargetedTags = "img, a, link";
for (Element e : doc.select(myTargetedTags)) {
switch (e.tagName().toLowerCase()) {
case "img":
e.attr("src", e.absUrl("src"));
break;
case "a":
e.attr("href", "http://www.some.com/translit/lat2cyr?" + e.absUrl("href"));
break;
case "link":
e.attr("href", e.absUrl("href"));
break;
default:
throw new RuntimeException("Unexpected element:\n" + e.outerHtml());
}
}
// Print out the final result
writer.print(doc.outerHtml());
writer.flush(); // Just to be sure that everything goes out...
writer.close();
注意:对于大型文档,我不知道这段代码如何执行。
示例代码
// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup
.parse( //
"<html><head><link rel=\"stylesheet\" type=\"text/css\" href=\"/css/main.css\"></head><body><img src=\"img/my-image.jpg\"><a href=\"/page/page.html\">an anchor</a></body></html>", //
"http://localhost");
System.out.println("** BEFORE**\n" + doc.outerHtml());
// Turn any url into an absolute url
// (same lines as above...)
// Print out the final result
System.out.println("\n** AFTER **\n" + doc.outerHtml());
输出
** BEFORE **
<html>
<head>
<link rel="stylesheet" type="text/css" href="/css/main.css">
</head>
<body>
<img src="img/my-image.jpg">
<a href="/page/page.html">an anchor</a>
</body>
</html>
** AFTER **
<html>
<head>
<link rel="stylesheet" type="text/css" href="http://localhost/css/main.css">
</head>
<body>
<img src="http://localhost/img/my-image.jpg">
<a href="http://www.some.com/translit/lat2cyr?http://localhost/page/page.html">an anchor</a>
</body>
</html>
在 JSoup 1.8.3 上测试
由于 whosebug.com,我有这个:
Document doc = Jsoup.connect(urlFromUser).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36").timeout(0).get();
doc.absUrl(urlFromUser);
doc.setBaseUri(urlFromUser);
Elements elements = doc.select("body");
Elements imgElements = doc.select("img");
for (Element element : imgElements) {
element.attr("src", element.attr("abs:src"));
}
Elements hrefElements = doc.select("a");
for (Element element : hrefElements) {
element.attr("href", "http://www.some.com/translit/lat2cyr?" + element.attr("abs:href"));
}
Elements linkElements = doc.head().select("link");
for (Element element : linkElements) {
element.attr("href", element.attr("abs:href"));
writer.print("");
manipulateElements(elements);
}
结果是:
<link rel="stylesheet" href="css/windows/windows.css?">
但我需要这个:
<link rel="stylesheet" href="http://DOMAIN.com/css/windows/windows.css?">
我试过了,但没有解决问题:
String host = uri.getHost();
host = "http://" + host;
writer.print(doc.toString().replaceAll("href=\"/css/", "href=\"" + host + "/css/").replaceAll("/jscript/", host + "/jscript/").replaceAll("/styles/", host + "/styles/").replaceAll("/functions/", host + "/functions/").replaceAll("href=\"/templates/", host + "/templates/").replaceAll("href=\"/plugins/", host + "/plugins/").replaceAll("href=\"css/", "href=\"" + host + "/css/"));
writer.close();
为了实现您的目标,您需要定制 OuterHtmlVisitor
。它会生成绝对 url 而不是相对 url。不幸的是,从 JSoup 1.8.3
开始,这个 class 是内部的。
您可以尝试编写自定义 NodeVisitor
实现,但工作量太大。
另一方面,这里有一个解决方法:
// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup //
.connect(urlFromUser) //
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") //
.timeout(0) //
.get();
// Turn any url into an absolute url
String myTargetedTags = "img, a, link";
for (Element e : doc.select(myTargetedTags)) {
switch (e.tagName().toLowerCase()) {
case "img":
e.attr("src", e.absUrl("src"));
break;
case "a":
e.attr("href", "http://www.some.com/translit/lat2cyr?" + e.absUrl("href"));
break;
case "link":
e.attr("href", e.absUrl("href"));
break;
default:
throw new RuntimeException("Unexpected element:\n" + e.outerHtml());
}
}
// Print out the final result
writer.print(doc.outerHtml());
writer.flush(); // Just to be sure that everything goes out...
writer.close();
注意:对于大型文档,我不知道这段代码如何执行。
示例代码
// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup
.parse( //
"<html><head><link rel=\"stylesheet\" type=\"text/css\" href=\"/css/main.css\"></head><body><img src=\"img/my-image.jpg\"><a href=\"/page/page.html\">an anchor</a></body></html>", //
"http://localhost");
System.out.println("** BEFORE**\n" + doc.outerHtml());
// Turn any url into an absolute url
// (same lines as above...)
// Print out the final result
System.out.println("\n** AFTER **\n" + doc.outerHtml());
输出
** BEFORE **
<html>
<head>
<link rel="stylesheet" type="text/css" href="/css/main.css">
</head>
<body>
<img src="img/my-image.jpg">
<a href="/page/page.html">an anchor</a>
</body>
</html>
** AFTER **
<html>
<head>
<link rel="stylesheet" type="text/css" href="http://localhost/css/main.css">
</head>
<body>
<img src="http://localhost/img/my-image.jpg">
<a href="http://www.some.com/translit/lat2cyr?http://localhost/page/page.html">an anchor</a>
</body>
</html>
在 JSoup 1.8.3 上测试