如何转储没有相对 URL 的 HTML 文档?

How to dump an HTML document without relative urls?

由于 whosebug.com,我有这个:

Document doc = Jsoup.connect(urlFromUser).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36").timeout(0).get();

doc.absUrl(urlFromUser);
doc.setBaseUri(urlFromUser);

Elements elements = doc.select("body");
Elements imgElements = doc.select("img");

for (Element element : imgElements) {
    element.attr("src", element.attr("abs:src"));
}

Elements hrefElements = doc.select("a");
for (Element element : hrefElements) {
    element.attr("href", "http://www.some.com/translit/lat2cyr?" + element.attr("abs:href"));
}

Elements linkElements = doc.head().select("link");
for (Element element : linkElements) {
    element.attr("href", element.attr("abs:href"));

    writer.print("");
    manipulateElements(elements);
}

结果是:

<link rel="stylesheet" href="css/windows/windows.css?">

但我需要这个:

<link rel="stylesheet" href="http://DOMAIN.com/css/windows/windows.css?">

我试过了,但没有解决问题:

String host = uri.getHost();
host = "http://" + host;

writer.print(doc.toString().replaceAll("href=\"/css/", "href=\"" + host + "/css/").replaceAll("/jscript/", host + "/jscript/").replaceAll("/styles/", host + "/styles/").replaceAll("/functions/", host + "/functions/").replaceAll("href=\"/templates/", host + "/templates/").replaceAll("href=\"/plugins/", host + "/plugins/").replaceAll("href=\"css/", "href=\"" + host + "/css/"));
writer.close();

为了实现您的目标,您需要定制 OuterHtmlVisitor。它会生成绝对 url 而不是相对 url。不幸的是,从 JSoup 1.8.3 开始,这个 class 是内部的。

您可以尝试编写自定义 NodeVisitor 实现,但工作量太大。

另一方面,这里有一个解决方法:

// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup //
    .connect(urlFromUser) //
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") //
    .timeout(0) //
    .get();

// Turn any url into an absolute url
String myTargetedTags = "img, a, link";
for (Element e : doc.select(myTargetedTags)) {
    switch (e.tagName().toLowerCase()) {
        case "img":
            e.attr("src", e.absUrl("src"));
            break;

        case "a":
            e.attr("href", "http://www.some.com/translit/lat2cyr?" + e.absUrl("href"));
            break;

        case "link":
            e.attr("href", e.absUrl("href"));
            break;

        default:
            throw new RuntimeException("Unexpected element:\n" + e.outerHtml());
    }
}

// Print out the final result
writer.print(doc.outerHtml());
writer.flush(); // Just to be sure that everything goes out...
writer.close();

注意:对于大型文档,我不知道这段代码如何执行。

示例代码

// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup
            .parse( //
               "<html><head><link rel=\"stylesheet\" type=\"text/css\" href=\"/css/main.css\"></head><body><img src=\"img/my-image.jpg\"><a href=\"/page/page.html\">an anchor</a></body></html>", //
               "http://localhost");
System.out.println("** BEFORE**\n" + doc.outerHtml());

// Turn any url into an absolute url
// (same lines as above...)

// Print out the final result
System.out.println("\n** AFTER **\n" + doc.outerHtml());

输出

** BEFORE **
<html>
 <head>
  <link rel="stylesheet" type="text/css" href="/css/main.css">
 </head>
 <body>
  <img src="img/my-image.jpg">
  <a href="/page/page.html">an anchor</a>
 </body>
</html>

** AFTER **
<html>
 <head>
  <link rel="stylesheet" type="text/css" href="http://localhost/css/main.css">
 </head>
 <body>
  <img src="http://localhost/img/my-image.jpg">
  <a href="http://www.some.com/translit/lat2cyr?http://localhost/page/page.html">an anchor</a>
 </body>
</html>

在 JSoup 1.8.3 上测试