当 < 存在于文本中时，jsoup 输出错误 HTML

Question

输入html是

<p>猫<虎</p>

可以被Chrome显示为猫<虎

但是当你使用 jsoup 解析 html 时，那么输出 html 是

<p>猫
  <虎 < p>
  </虎<></p>

如何在不修改

的情况下解决这个问题

< to &lt;

Answer 1

为什么你认为 jsoup 是“错误的”而 chrome 是“正确的”？不属于标签的 < 应始终转义为 <（否则它将被解释为 opening a tag) - fix that, and all standards-compliant html tools will agree on the same parsing. Do not fix it, and some may disagree. In this case, JSoup is accepting non-alphanumerics as tag-name, which is invalid。但它遇到了未转义的 <标签名称的一部分！

如果您坚持不更改源代码html，您可以在将其输入 JSoup 之前简单地对其进行预处理：

 // before 
 Document doc = Jsoup.parse(html);

 // with pre-processing
 Document doc = Jsoup.parse(fixOutOfTagLessThan(html));

哪里

 /**
  * Replaces not-in-tag `<` by `&lt;`, but WILL FAIL in 
  * many cases, because it is unaware of:
  * - comments (<!--)
  * - javascript
  * - the fact that you should NOT PARSE HTML WITH REGEX
  */
 public static void fixOutOfTagLessThan(String html) {
    return html.replaceAll("<([^</>]+)<", "&lt;<");
 }

Chrome 似乎正在对 treat the < as text 应用 HTML5 解析逻辑（因为它不是有效标签名称的一部分）-但是，据我了解，它应该拒绝一切都到 >，然后发出缺失的 </p>。所以，在我看来，它似乎也没有完全遵循标准。

当 < 存在于文本中时，jsoup 输出错误 HTML

jsoup output wrong HTML when < exists inside text

html-parser

jsoup