为什么这两个 DOMDocument 函数的行为不同？

Question

这里有两种获取 DOMDocument 节点外部 HTML 的方法：How to return outer html of DOMDocument?

我对为什么他们似乎以不同方式对待 HTML 实体很感兴趣。

示例：

function outerHTML($node) {
    $doc = new DOMDocument();
    $doc->appendChild($doc->importNode($node, true));
    return $doc->saveHTML();
}

$html = '<p>ACME&rsquo;s 27&rdquo; Monitor is 0.</p>';
$dom = new DOMDocument();
@$dom->loadHTML($html);
$el = $dom->getElementsByTagname('p')->item(0);
echo $el->ownerDocument->saveHtml($el) . PHP_EOL;
echo outerHTML($el) . PHP_EOL;

输出：

<p>ACME’s 27” Monitor is 0.</p>
<p>ACME&rsquo;s 27&rdquo; Monitor is 0.</p>

这两种方法都使用 saveHTML() 但出于某种原因，该函数在最终输出中保留了 html 个实体，而直接调用 saveHTML() 节点上下文则不会。谁能解释为什么 - 最好有某种权威参考？

Answer 1

这比上面的测试用例更简单：

<?php
$html = '<p>ACME&rsquo;s 27&rdquo; Monitor is 0.</p>';
$dom = new DOMDocument();
@$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

echo $dom->saveHtml($dom->documentElement) . PHP_EOL;
echo $dom->saveHtml() . PHP_EOL;

所以问题就变成了，为什么 DomDocument::saveHtml 在保存整个文档而不是仅保存特定节点时表现不同？

查看 PHP 源代码，我们发现 a check 它是处理单个节点还是整个文档。对于前者，调用 htmlNodeDumpFormatOutput 函数时编码显式设置为 null。对于后者，使用 htmlDocDumpMemoryFormat 函数，编码不作为此函数的参数。

这两个函数都来自 libxml2 库。查看 that 来源，我们可以看到 htmlDocDumpMemoryFormat tries to detect the document encoding，如果找不到，则明确将其设置为 ASCII/HTML。

两个函数最终都调用了 htmlNodeListDumpOutput，将已确定的编码传递给它；要么是 null——这导致没有编码——要么是 ASCII/HTML——它使用 HTML 个实体进行编码。

我的猜测是，对于文档片段或单个节点，编码被认为不如完整文档重要。

为什么这两个 DOMDocument 函数的行为不同？

Why do these two DOMDocument functions behave differently?

html

php

domdocument

html-entities