HTML Purifier：禁用语法修复

Question

考虑 HTML 净化器的以下设置：

require_once 'library/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('Core.EscapeInvalidTags', true);
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);

如果你运行以下情况：

$dirty_html = "<p>lorem <script>ipsum</script></p>";

//output
<p>lorem &lt;script&gt;ipsum&lt;/script&gt;</p>

正如预期的那样，它没有删除无效标签，而是将它们全部转义。

但是，请考虑这些其他测试用例：

案例 1

$dirty_html = "<p>lorem <b>ipsum</p>";

//output
<p>lorem <b>ipsum</b></p>

//desired output
<p>lorem &lt;b&gt;ipsum</p>

案例 2

$dirty_html = "<p>lorem ipsum</b></p>";

//output
<p>lorem ipsum</p>

//desired output
<p>lorem ipsum&lt;/b&gt;</p>

案例 3

$dirty_html = "<p>lorem ipsum<script></script></p>";

//output
<p>lorem ipsum&lt;script /&gt;</p>

//desired output
<p>lorem ipsum&lt;script&gt;&lt;/script&gt;</p>

它不是仅仅转义无效标签，而是首先修复它们然后转义它们。这样事情就会变得很奇怪，例如：

案例4

$dirty_html = "<p><a href='...'><div>Text</div></a></p>";

//output
<p><a href="..."></a></p><div><a href="...">Text</a></div><a href="..."></a>&lt;/p&gt;

问题
因此，是否可以禁用语法修复并只转义无效标签？

Answer 1

您看到语法修复的原因是 HTML Purifier 处理 HTML 卫生主题的基本方式：它首先解析 HTML 以理解它, 然后决定将哪些元素保留在解析的表示中，然后呈现 HTML.

您可能熟悉 Whosebug 的一个 most famous answers，这是一个有趣而恼怒的观察结果，真正的正则表达式无法解析 HTML - 您需要额外的逻辑，因为 HTML 是上下文无关语言，不是常规语言。（现代 'regular' 表达式不是正式的正则表达式，但那是另一回事。）换句话说，如果你真的想知道你的 HTML 中发生了什么 - 这样你就可以正确地应用你的 white- 或列入黑名单 - 您需要对其进行解析，这意味着文本以完全不同的表示形式结束。

解析如何导致输入和输出之间发生变化的一个例子是 HTML 净化器 strips extraneous whitespace from between attributes，在你的情况下这可能不会打扰你，但仍然源于 [= 的解析表示41=] 与文本表示有很大不同。它不是试图保留您输入的形式 - 它试图保留功能。

当没有明确的功能并且它必须开始猜测时，这会变得很棘手。举个例子，想象一下在通过 HTML 输入时，你会在不知不觉中遇到一个看起来像开始的 <td> 标记的东西 - 你可以 consider it valid if there was an unclosed <td> tag a while back as long as you add a closing tag, but if you had escaped the first tag as <td>, you would need to discard the text data that would have been in the <td> 因为 - 取决于浏览器渲染- 它可能会将数据放入片段之外的页面部分，即用户未明确提交的地方。

简而言之：您不能轻易禁用 all 语法修复 and/or 整理，而不必翻阅 HTML 净化器的解析内容并确保您认为有价值的信息不会丢失。

也就是说，您 可以尝试切换 the underlying parsing engine with Core.LexerImpl and see if it gets you better results! :) DOMLex definitely adds missing ending nodes right from the get-go, but from a cursory glance, DirectLex may not. There is a large chunk of autoclosing logic in HTMLPurifier's MakeWellFormed strategy class 这也可能会给您带来问题。

根据为什么你想保留这些数据，虽然（允许分析？），单独保存原始输入（同时留下 HTML Purifier 本身）可能会为您提供更好的解决方案。

HTML Purifier：禁用语法修复

HTML Purifier: disable syntax repair

html

sanitization

htmlpurifier