BeautifulSoup 在 findAll 中排除一个标签

Question

在 beautifulsoup 中，我们如何在使用 findAll 时排除特定标签中的标签。

让我们考虑这个例子，我想找到 html 中的所有  标签，除了 <tr> 标签中的

标签。

soup.findAll(['p'])

以上代码将获取所有  标签，但我需要排除 <tr> 标签中的  标签。

Answer 1

如果我没理解错的话，你想 select 所有 p 没有 tr 的 parent 在任何级别。

您可以 select 所有 p 然后使用 findParent 函数过滤结果。 findParent 将 return 具有给定标签名称的第一个 parent 否则 None.

from bs4 import BeautifulSoup

html = """
  <tr>
    <p>1</p>
  </tr>
  
  <tr>
    <td>
      <p>2</p>
    </td>
  </tr>
  
  <p>3</p>
  
  <div>
    <p>4</p>
  </div>
"""

soup = BeautifulSoup(html, "html.parser")
print([p for p in soup.findAll('p') if not p.findParent('tr')])

Answer 2

您可以使用 .select。示例：
Select 所有  标签，但排除 <tr> 标签内的  标签。

soup.select('p:not(tr > p)')

Select 所有  标签，但不包括  标签 <tr> 标签

的子标签

soup.select('p:not(tr p)')

Select 所有  和 <h2> 标签，但不包括  标签 <tr> 标签

的子标签

soup.select('p,h2:not(tr p)')

BeautifulSoup 在 findAll 中排除一个标签

BeautifulSoup exclude a tag in findAll

html

python

lxml

beautifulsoup