如何获取BeautifulSoup中嵌套标签的所有元素?

How to get all elements of nested tags in BeautifulSoup?

我对 HTML 和解析还很陌生,所以如果我为此使用了错误的术语,我深表歉意。我之前问过类似的问题并找到了一些有用的答案。我有以下 HTML 片段,由两个 table 和两个 table headers 组成(还有更多行但与此 post 无关)

<body>
    <table>
        <tr class="header">
            <th><strong>Heading 1</strong></th>
            <th><strong>Heading 2</strong></th>
            <th><strong>Heading 3</strong></th>
            <th><p><strong>Heading 4, line 1</strong></p>
            <p><strong>Heading 4, line 2</strong></p></th>
        </tr>

        <tr>
            <!--Many more rows-->>
        </tr>
    </table>

    <table>
        <tr class="header">
            <th><strong>Diff Header 1</strong></th>
            <th><strong>Diff Header 2</strong></th>
            <th><strong>Diff Header 3</strong></th>
            <th><p><strong>Diff Header 4, line 1</strong></p>
            <p><strong>Diff Header 4, line 2</strong></p></th>
        </tr>
    
        <tr>
            <!--Many more rows-->>
        </tr>
    </table>
</body>

我正在尝试使用 python3.6 和 BeautifulSoup4 来解析它并将文本提取到列表中。我的问题是,我希望每个块都有单独的列表。我当前的代码似乎搜索并找到所有 <th> 标签,而不是第一个 table.

中的标签

这是我的:

def parse_html(self):
    """ Parse the html file """
    with open(self.html_path) as f:
        soup = BeautifulSoup(f, 'html.parser')

    tables = soup.find_all('table')
    
    for table in tables:
        # Find each row in the table
        rows = table.find_all_next('tr')
        for row in rows:
            # Find each column in the row
            cols = row.find_all_next('th')
            for col in cols:
                # Print each cell
                print(col) # This is where it seems to be finding every <th>

            break          # Break just to do the first row (seems not to work?)

问题:如何修改这段代码,使其只在当前行而不是每一行中找到 <th> 标签?

感谢您的帮助!

使用 .find_all 而不是 .find_all_next

如果 html_doc 是问题中的 HTML 片段:

soup = BeautifulSoup(html_doc, "html.parser")

tables = soup.find_all("table")

for table in tables:
    # Find each row in the table
    rows = table.find_all("tr")
    for row in rows:
        cols = row.find_all("th")
        for col in cols:
            print(col)

    print("-" * 80)

打印:

<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
--------------------------------------------------------------------------------