使用 jsoup 或任何其他库通过原始 xpath 从 HTML 中删除元素

Question

我正在尝试使用原始 xpath 从 HTML 中删除一个元素。

        final Document document = Jsoup.parse(htmlAsString);
        final Elements elements = document.select("/html/head");
        elements.forEach(Node::remove);

但是遇到以下错误，

org.jsoup.select.Selector$SelectorParseException: Could not parse query '/html/head': unexpected token at '/html/head'
at org.jsoup.select.QueryParser.findElements(QueryParser.java:206)
at org.jsoup.select.QueryParser.parse(QueryParser.java:59)
at org.jsoup.select.QueryParser.parse(QueryParser.java:42)
at org.jsoup.select.Selector.select(Selector.java:91)
at org.jsoup.nodes.Element.select(Element.java:372)

有没有办法处理从 html 到 get/delete 元素的原始 xpath。

Answer 1

jsoup 本身支持一组 CSS 选择器，而不是 xpath。你可以这样做：

Document doc = Jsoup.parse(html);
document.select("html > head").remove();

（请参阅 Selector syntax and Elements#remove() 文档。）

如果您需要专门使用 xpath（为什么？），您可以使用 jsoup 的 W3C Dom converter 将 jsoup 文档转换为 W3C 文档 (Java XML)，并且运行 xpath 查询：

import org.w3c.dom.Document;
import org.w3c.dom.Node;
...

org.jsoup.nodes.Document jdoc = Jsoup.parse(html);
Document w3doc = W3CDom.convert(jdoc);

String query = "/html/head";
XPathExpression xpath = XPathFactory.newInstance().newXPath().compile(query);
Node head = (Node) xpath.evaluate(w3doc, XPathConstants.NODE);

使用 jsoup 或任何其他库通过原始 xpath 从 HTML 中删除元素

Delete element from HTML by raw xpath using jsoup or any other library

java

xpath

html-parsing

jsoup

spring-boot