如何获取BeautifulSoup中嵌套标签的所有元素?
How to get all elements of nested tags in BeautifulSoup?
我对 HTML 和解析还很陌生,所以如果我为此使用了错误的术语,我深表歉意。我之前问过类似的问题并找到了一些有用的答案。我有以下 HTML 片段,由两个 table 和两个 table headers 组成(还有更多行但与此 post 无关)
<body>
<table>
<tr class="header">
<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
</tr>
<tr>
<!--Many more rows-->>
</tr>
</table>
<table>
<tr class="header">
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
</tr>
<tr>
<!--Many more rows-->>
</tr>
</table>
</body>
我正在尝试使用 python3.6 和 BeautifulSoup4 来解析它并将文本提取到列表中。我的问题是,我希望每个块都有单独的列表。我当前的代码似乎搜索并找到所有 <th>
标签,而不是第一个 table.
中的标签
这是我的:
def parse_html(self):
""" Parse the html file """
with open(self.html_path) as f:
soup = BeautifulSoup(f, 'html.parser')
tables = soup.find_all('table')
for table in tables:
# Find each row in the table
rows = table.find_all_next('tr')
for row in rows:
# Find each column in the row
cols = row.find_all_next('th')
for col in cols:
# Print each cell
print(col) # This is where it seems to be finding every <th>
break # Break just to do the first row (seems not to work?)
问题:如何修改这段代码,使其只在当前行而不是每一行中找到 <th>
标签?
感谢您的帮助!
使用 .find_all
而不是 .find_all_next
。
如果 html_doc
是问题中的 HTML 片段:
soup = BeautifulSoup(html_doc, "html.parser")
tables = soup.find_all("table")
for table in tables:
# Find each row in the table
rows = table.find_all("tr")
for row in rows:
cols = row.find_all("th")
for col in cols:
print(col)
print("-" * 80)
打印:
<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
我对 HTML 和解析还很陌生,所以如果我为此使用了错误的术语,我深表歉意。我之前问过类似的问题并找到了一些有用的答案。我有以下 HTML 片段,由两个 table 和两个 table headers 组成(还有更多行但与此 post 无关)
<body>
<table>
<tr class="header">
<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
</tr>
<tr>
<!--Many more rows-->>
</tr>
</table>
<table>
<tr class="header">
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
</tr>
<tr>
<!--Many more rows-->>
</tr>
</table>
</body>
我正在尝试使用 python3.6 和 BeautifulSoup4 来解析它并将文本提取到列表中。我的问题是,我希望每个块都有单独的列表。我当前的代码似乎搜索并找到所有 <th>
标签,而不是第一个 table.
这是我的:
def parse_html(self):
""" Parse the html file """
with open(self.html_path) as f:
soup = BeautifulSoup(f, 'html.parser')
tables = soup.find_all('table')
for table in tables:
# Find each row in the table
rows = table.find_all_next('tr')
for row in rows:
# Find each column in the row
cols = row.find_all_next('th')
for col in cols:
# Print each cell
print(col) # This is where it seems to be finding every <th>
break # Break just to do the first row (seems not to work?)
问题:如何修改这段代码,使其只在当前行而不是每一行中找到 <th>
标签?
感谢您的帮助!
使用 .find_all
而不是 .find_all_next
。
如果 html_doc
是问题中的 HTML 片段:
soup = BeautifulSoup(html_doc, "html.parser")
tables = soup.find_all("table")
for table in tables:
# Find each row in the table
rows = table.find_all("tr")
for row in rows:
cols = row.find_all("th")
for col in cols:
print(col)
print("-" * 80)
打印:
<th><strong>Heading 1</strong></th>
<th><strong>Heading 2</strong></th>
<th><strong>Heading 3</strong></th>
<th><p><strong>Heading 4, line 1</strong></p>
<p><strong>Heading 4, line 2</strong></p></th>
--------------------------------------------------------------------------------
<th><strong>Diff Header 1</strong></th>
<th><strong>Diff Header 2</strong></th>
<th><strong>Diff Header 3</strong></th>
<th><p><strong>Diff Header 4, line 1</strong></p>
<p><strong>Diff Header 4, line 2</strong></p></th>
--------------------------------------------------------------------------------