正则表达式匹配 Header 不特定的标签 Div

Question

所以我有 PHP 代码输出 HTML 看起来像这样：

<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>

我想要做的是 preg_match_all 个 header 标签。我的正则表达式 (<h([1-6]{1})[^>]*)>.*<\/h> returns 都合适，但我不想用 class div header "ignore"。我正在阅读有关负面前瞻的信息，但这变得很棘手。任何提供帮助的人都将不胜感激。

期望的输出：

<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>

请注意，我在这里也被省略了，因为它被包裹在 div 和 class "ignore".

中

Answer 1

不要在这里乱用正则表达式 - 结合 xpath 查询释放 DOMDocument 的力量：

<?php
$html = <<<EOT
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
EOT;

$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXpath($doc);
$headers = $xpath->query("
    //div[not(contains(@class, 'ignore'))]
    /*[self::h2 or self::h4 or self::h5]");

foreach ($headers as $header) {
    echo $header->nodeValue . "\n";
}

?>

这将产生

This is a header
This is one too
Here's one

Answer 2

与 DOMDocument 和 DOMXPath:

$html = <<<'HTML'
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);

$nodeList = $xp->query('
//*
[contains(";h1;h2;h3;h4;h5;h6;", concat(";", local-name(), ";"))]
[not(ancestor::div[
    contains(concat(" ", normalize-space(@class), " "), " ignore ")
    ])
]');

foreach ($nodeList as $node) {
    echo 'tag name: ', $node->nodeName, PHP_EOL,
         'html content: ', $dom->saveHTML($node), PHP_EOL,
         'text content: ', $node->textContent, PHP_EOL,
         PHP_EOL;
}

demo

如果您对 XPath 不满意，请查看 zvon tutorial。

Answer 3

由于您指定要使用 preg_match() 执行此操作，因此这里是一个负向后视的示例（即过滤掉那些前面没有 XYZ 的事件）：https://regex101.com/r/FeAsuj/1

回顾本身是 (?<!<div class=\"ignore\">) .

但是在测试片段中，请注意：

正则表达式取决于空格的确切使用...
...所以依赖于平台的 \r\n 可以破坏正则表达式
lookbehind 不能有可变长度，即 \n? - 参见 Regular Expression Lookbehind doesn't work with quantifiers ('+' or '*')

如果您必须继续使用正则表达式，请考虑两步法：

第 1 步，您使用 preg_replace() 删除所有不需要的部分。
第 2 步，使用现有的正则表达式。

总的来说，我会同意其他张贴者的意见以避免正则表达式，并使用 HTML 解析器。

正则表达式匹配 Header 不特定的标签 Div

Regular Expression To Match Header Tags Not In Specific Div

php

regex

preg-match