如何在标签和结束标签之间使用 XPath 提取文本
How to extract the text using XPath between tag and some end tag
我给出了以下 HTML。 class 名称始终相同。只有标签之间的文字有所不同,长度和内容也不同。
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text <q class="aaa">this not</q></span>
</a>
如何提取带有 class "zzz" 的标签和行尾之间的内容,但带有 class "aaa" 的元素应该 不包含在结果中?可能吗?
带class的元素"aaa"可能存在也可能不存在:
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text</span>
</a>
预期结果应该是:
This is the required text
另外"the required text"部分可能存在也可能不存在:
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span></span>
</a>
所以结果应该是:
This is
我在 PHP 中使用 DOMXPath 进行了尝试。
我不知道如何使用 XPath 执行此操作,但这是一种无需 XPath 即可执行此操作的方法。
function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from walk($n);
}
}
}
$html = <<<'HTML'
<span class="xxx">
Not this text
<span class="yyy">not this text</span>
<span class="zzz">This is</span>
the required text
<q class="aaa">this not</q>
</span>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$count = 0;
foreach(walk($dom->firstChild) as $node) {
if (!($node instanceof DOMText) && $node->hasAttribute('class') && $node->getAttribute('class') === 'xxx') {
foreach(walk($node) as $n) {
if (isset($content)) {
$count++;
}
if (!($n instanceof DOMText) && $n->hasAttribute('class') && $n->getAttribute('class') === 'zzz') {
$content = $n->textContent;
}
if (isset($content) && $n instanceof DOMText && $count == 2) {
$content .= " " . $n->textContent;
break 2;
}
}
}
}
var_dump($content);
无论 "the required text"
部分是否存在,这都会为您提供所需的结果。
XPath 解决方案:
$xml = <<<'XML'
<a><span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text</span></a>
XML;
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$elements = $xpath->query('//text()[parent::*[not(@class="aaa")]][preceding::span[@class="yyy"]][normalize-space()]');
foreach($elements as $element)
echo $element->nodeValue;
输出:
This is the required text
我给出了以下 HTML。 class 名称始终相同。只有标签之间的文字有所不同,长度和内容也不同。
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text <q class="aaa">this not</q></span>
</a>
如何提取带有 class "zzz" 的标签和行尾之间的内容,但带有 class "aaa" 的元素应该 不包含在结果中?可能吗?
带class的元素"aaa"可能存在也可能不存在:
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text</span>
</a>
预期结果应该是:
This is the required text
另外"the required text"部分可能存在也可能不存在:
<a>
<span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span></span>
</a>
所以结果应该是:
This is
我在 PHP 中使用 DOMXPath 进行了尝试。
我不知道如何使用 XPath 执行此操作,但这是一种无需 XPath 即可执行此操作的方法。
function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from walk($n);
}
}
}
$html = <<<'HTML'
<span class="xxx">
Not this text
<span class="yyy">not this text</span>
<span class="zzz">This is</span>
the required text
<q class="aaa">this not</q>
</span>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$count = 0;
foreach(walk($dom->firstChild) as $node) {
if (!($node instanceof DOMText) && $node->hasAttribute('class') && $node->getAttribute('class') === 'xxx') {
foreach(walk($node) as $n) {
if (isset($content)) {
$count++;
}
if (!($n instanceof DOMText) && $n->hasAttribute('class') && $n->getAttribute('class') === 'zzz') {
$content = $n->textContent;
}
if (isset($content) && $n instanceof DOMText && $count == 2) {
$content .= " " . $n->textContent;
break 2;
}
}
}
}
var_dump($content);
无论 "the required text"
部分是否存在,这都会为您提供所需的结果。
XPath 解决方案:
$xml = <<<'XML'
<a><span class="xxx">Not this text <span class="yyy">not this text</span> <span class="zzz">This is</span> the required text</span></a>
XML;
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$elements = $xpath->query('//text()[parent::*[not(@class="aaa")]][preceding::span[@class="yyy"]][normalize-space()]');
foreach($elements as $element)
echo $element->nodeValue;
输出:
This is the required text