查找 headers 之间的同级项

Finding sibling items between headers

我正在尝试为 XML 中组成的数据文件抓取一些文档。我一直通过阅读页面并输入来手动编写 XSD,但我发现这是页面抓取的主要情况。一般格式(据我所知,基于随机样本)类似于以下内容:

   <h2>
    <span class="mw-headline" id="The_.22show.22_Child_Element">
     The "show" Child Element
    </span>
   </h2>
   <dl>
    <dd>
     <table class="infotable">
      <tr>
       <td class="leftnormal">
        allowstack
       </td>
       <td>
        (Optional) Boolean – Indicates whether the user is allowed to stack items within the table, subject to the restrictions imposed for each item. Default: "yes".
       </td>
      </tr>
      <tr>
       <td>
        agentlist
       </td>
       <td>
        (Optional) Id – If set to a tag group id, only picks with the agent pick's identity tag from that group are shown in the table. Default: None.
       </td>
      </tr>
      <tr>
       <td>
        allowmove
       </td>
       <td>
        (Optional) Boolean – Indicates whether picks in this table can be moved out of this table, if the user drags them around. Default: "yes".
       </td>
      </tr>
      <tr>
       <td>
        listpick
       </td>
       <td>
        (Optional) Id – Unique id of the pick to take the table's list expression from (see listfield, below). Note that this does not work when used with portals. Default: None.
       </td>
      </tr>
      <tr>
       <td>
        listfield
       </td>
       <td>
        (Optional) Id – Unique id of the field to take the table's list expression from (see listpick, above). Note that this does not work when used with portals. Default: None.
       </td>
      </tr>
     </table>
    </dd>
   </dl>
   <p>
    The "show" element also possesses child elements that define additional behaviors of the table. The list of these child elements is below and must appear in the order shown. Click on the link to access the details for each element.
   </p>
   <dl>
    <dd>
     <table class="infotable">
      <tr>
       <td class="leftnormal">
        <a href="index.php5@title=TableDef_Element_(Data).html#list">
         list
        </a>
       </td>
       <td>
        An optional "list" element may appear as defined by the given link. This element defines a
        <a href="index.php5@title=List_Tag_Expression.html" title="List Tag Expression">
         List Tag Expression
        </a>
        for the table.
       </td>
      </tr>
     </table>
    </dd>
   </dl>

每个文件都有一个非常清晰的模式,其中包含许多由 header 定义的元素,然后是文本,然后是 table(通常是属性),可能还有另一组文本和a table(对于 child 元素)。我想我可以通过简单地使用 nextnext-sibling 遍历项目并尝试扫描文本以确定以下 table 是属性还是 类 来找到合理的解决方案,但感觉有点奇怪,我不能只抓取两个 header 标签之间的所有内容然后扫描它。

您可以同时搜索多个元素,例如<h2><table>。然后,您可以在处理每个 <table>.

之前记下每个 <h2> 的内容

例如:

soup = BeautifulSoup(html, "html.parser")

for el in soup.find_all(['h2', 'table']):
    if el.name == 'h2':
        h2 = el.get_text(strip=True)
        h2_id = el.span['id']
    else:
        for tr in el.find_all('tr'):
            row = [td.get_text(strip=True) for td in tr.find_all('td')]
            print([h2, h2_id, *row])