如何获取BeautifulSoup中嵌套标签的所有元素？

Question

我对 HTML 和解析还很陌生，所以如果我为此使用了错误的术语，我深表歉意。我之前问过类似的问题并找到了一些有用的答案。我有以下 HTML 片段，由两个 table 和两个 table headers 组成（还有更多行但与此 post 无关）

<body>
    <table>
        <tr class="header">
            <th><strong>Heading 1</strong></th>
            <th><strong>Heading 2</strong></th>
            <th><strong>Heading 3</strong></th>
            <th><p><strong>Heading 4, line 1</strong></p>
            <p><strong>Heading 4, line 2</strong></p></th>
        </tr>

        <tr>
            <!--Many more rows-->>
        </tr>
    </table>

    <table>
        <tr class="header">
            <th><strong>Diff Header 1</strong></th>
            <th><strong>Diff Header 2</strong></th>
            <th><strong>Diff Header 3</strong></th>
            <th><p><strong>Diff Header 4, line 1</strong></p>
            <p><strong>Diff Header 4, line 2</strong></p></th>
        </tr>
    
        <tr>
            <!--Many more rows-->>
        </tr>
    </table>
</body>

我正在尝试使用 python3.6 和 BeautifulSoup4 来解析它并将文本提取到列表中。我的问题是，我希望每个块都有单独的列表。我当前的代码似乎搜索并找到所有 <th> 标签，而不是第一个 table.

中的标签

这是我的：

def parse_html(self):
    """ Parse the html file """
    with open(self.html_path) as f:
        soup = BeautifulSoup(f, 'html.parser')

    tables = soup.find_all('table')
    
    for table in tables:
        # Find each row in the table
        rows = table.find_all_next('tr')
        for row in rows:
            # Find each column in the row
            cols = row.find_all_next('th')
            for col in cols:
                # Print each cell
                print(col) # This is where it seems to be finding every <th>

            break          # Break just to do the first row (seems not to work?)

问题：如何修改这段代码，使其只在当前行而不是每一行中找到 <th> 标签？

感谢您的帮助！

Answer 1

使用 .find_all 而不是 .find_all_next。

如果 html_doc 是问题中的 HTML 片段：

soup = BeautifulSoup(html_doc, "html.parser")

tables = soup.find_all("table")

for table in tables:
    # Find each row in the table
    rows = table.find_all("tr")
    for row in rows:
        cols = row.find_all("th")
        for col in cols:
            print(col)

    print("-" * 80)

打印：

<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
--------------------------------------------------------------------------------

如何获取BeautifulSoup中嵌套标签的所有元素？

How to get all elements of nested tags in BeautifulSoup?

html

python

parsing

beautifulsoup