如何使用 XPath 在 XML 节点中提取 html 标记

How to extract html markup within an XML node with XPath

我正在使用 DOMDocument and XPath

已关注 XML

<Description>
    <CompleteText>
        <DetailTxt>
            <Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br/>
                <span>Normal position</span>
                <br/>
                <span> </span>
                <br/>
            </Text>
        </DetailTxt>            
    </CompleteText>
</Description>

节点 /Description/CompleteText/DetailTxt/Text 包含标记,不幸的是未转义,但我无法更改它。我有没有机会查询 维护 html 标记的内容?

我试过的

显然,nodeValue but also textContent。两者都给我省略标记的内容。

您可以使用 DOMDocumentsaveHTML 方法将节点序列化为 HTML,在您的情况下,您似乎想在所选节点的每个子节点上调用它并连接字符串;在浏览器中 DOM APIs 将被调用 innerHTML 所以我写了一个同名的函数来做这个并且还在下面的代码片段中使用了从 XPath 调用 PHP 函数的能力:

<?php
$xml = <<<'EOD'
<Description>
    <CompleteText>
        <DetailTxt>
            <Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br/>
                <span>Normal position</span>
                <br/>
                <span> </span>
                <br/>
            </Text>
        </DetailTxt>            
    </CompleteText>
</Description>  
EOD;

$doc = new DOMDocument();

$doc->loadXML($xml);

$xpath = new DOMXPath($doc);

function innerHTML($nodeList) {
  $node = $nodeList[0];
  $html = '';
  $containingDoc = $node->ownerDocument;
  foreach ($node->childNodes as $child) {
      $html .= $containingDoc->saveHTML($child);
    }
  return $html;
}

$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions("innerHTML");



$innerHTML = $xpath->evaluate('php:function("innerHTML", /Description/CompleteText/DetailTxt/Text)');

echo $innerHTML;

输出 http://sandbox.onlinephpfunctions.com/code/62a980e2d2a2485c2648e16fc647a6bd6ff5620b

            <span>Here there is some text</span>
            <h2>And maybe a headline</h2>
            <br>
            <span>Normal position</span>
            <br>
            <span> </span>
            <br>

我发现使用 C14n method of DOMNode 的效果很好。

http://sandbox.onlinephpfunctions.com/code/90dc915c9a43c91d31fcd47d37e89df430951b2e

<?php
$xml = <<<'EOD'
<Description>
    <CompleteText>
        <DetailTxt>
            <Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br/>
                <span>Normal position</span>
                <br/>
                <span> </span>
                <br/>
            </Text>
        </DetailTxt>            
    </CompleteText>
</Description>  
EOD;

$doc = new DOMDocument();

$doc->loadXML($xml);

$xpath = new DOMXPath($doc);

function innerHTML($nodeList) {
  $node = $nodeList[0];
  $html = '';
  $containingDoc = $node->ownerDocument;
  foreach ($node->childNodes as $child) {
      $html .= $containingDoc->saveHTML($child);
    }
  return $html;
}

$xpath->registerNamespace("php", "http://php.net/xpath");


$domNodes = $xpath->query('/Description/CompleteText/DetailTxt/Text');
$domNode = $domNodes[0];
$innerHTML = $domNode->C14N();

echo $innerHTML;

结果

<Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br></br>
                <span>Normal position</span>
                <br></br>
                <span> </span>
                <br></br>
            </Text>

在某种程度上似乎更短了,你觉得怎么样?不过,我需要摆脱节点。 也感谢您将我指向 PHP Sandbox。

更新

我知道,C14N() 更改了标记。参见 <br /><br></br>