查找前面没有内容的正则表达式模式

Question

我有以下 HTML 文件结构：

<table>
   <tr class="heading">
      <td colspan="2">
         <h2 class="groupheader">Public Types</h2> 
         <!-- I don't want that! We're in a table.-->
      </td>
   </tr>
   <tr>...</tr> 
</table>
<h2 class="groupheader">Detailed Description</h2>
  <!-- I want all that until the next h2-->
  <div class="textblock"><p>Provides the functions to control the generation of a single data log file. </p>
    <h4>Example</h4>
    <div class="fragment"><div class="line">Test <a href="aaa">stuff</a>();</div>
        <div class="line">...</div>     
        <div class="line">...</div>
    </div>
</div> <!-- end of first result -->

<h2 class="groupheader">Member</h2>
<!-- I want all that until the next h2 or hr-->
<a class="anchor"></a>
<div class="memitem">
<div class="memproto">
      <table class="memname">
        <tr>
          <td class="memname">enum <a class="el" href="...">test</a></td>
        </tr>
      </table>
</div><div class="memdoc">
<hr><!-- End of 2nd result -->

并且使用正则表达式，我需要获取每个标题之间的所有内容，直到下一个标题或 hr 标签，预计它是否在 table 中。

到目前为止，我已经获得了所有 h2->h2|hr 内容。它是这样的：

(?s)(<h2 class="groupheader">.*?)(<h2|<hr)

如何跳过table中包含的H2下的内容？我试过带着消极的表情面面相觑，但我一无所获。

感谢您的帮助。

Answer 1

请注意 HTML 应该使用适当的解析器进行解析

现在，因为我们只剩下看起来 HTML 的输入和一项任务

to get all the content between each titles till the next title or hr tag, expect if it's a in a table

让我展示一下它是如何完成的。

您可以在 tempered greedy token ((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*) 的帮助下获得所需的子字符串（匹配任何未在其之前的负前瞻中启动任何替代项的符号 - 因此，保持<table> 边界内的匹配 - 并且还匹配内部表），最后有一个积极的前瞻：

(?s)<h2 class="groupheader">[^<]*<\/h2>\s*((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*)(?=<h2|<hr)

见demo。

请注意，您可以使用 h\d+ 代替 h2 来支持 h 的任何级别。

查找前面没有内容的正则表达式模式

Finding a regexp pattern not preceeded by something

html

regex

tags

negative-lookbehind