如何在 Python 中理解 "recursive" 与 BeautifulSoup

Question

我正在 Python 中与 BeautifulSoup 进行“递归”项目。看了官方文档，问了很多问题还是没搞懂。

from bs4 import BeautifulSoup
s = "<div>C<p><strong>A</strong>B</p></div>"
soup = BeautifulSoup(s, 'html.parser')

是不是因为我们在<div></div>之外再也找不到任何东西了？

如果我第一个问题的想法是正确的，我猜这会给出 B，因为我们无法进入更深的深度。但是为什么要从开始呢？为什么不 ？

另外，如何提取B？

Answer 1

当您使用 recursive=False 时，这意味着只搜索您调用 .find() 或 .find_all() 的元素的直接子元素。顶级 soup 对象的唯一直接子元素是 <div> 元素。由于它不是  元素，因此与给定的名称不匹配，因此找不到任何内容。

在第二个示例中，您首先使用递归搜索来查找  元素。然后你调用 .find() 没有名称，所以它会匹配任何元素名称。由于您指定了 recursive=false，它只考虑  的直接子代。第一个子元素是 A，它被返回。

Answer 2

Recursive = False returns 只有您要查找的标签元素的子元素。例如：

<li>
    <p>1</p>
    <p>2</p>
    <div>
      <p>3</p>
    </div>
</li>

li = soup.find('li')

现在，

print(li.findChildren("p"))

prints [1, 2, 3]

print(li.findChildren("p", recursive=False))

prints [1, 2]

为了从<div>CAB</div>得到B:

s = "<div>C<p><strong>A</strong>B</p></div>"
soup = BeautifulSoup(s, 'html.parser')
soup.strong.decompose()
print(soup.p)

prints B

解释：

print(soup.strong)

prints A

soup.strong.decompose()

removes A (Beautiful Soup decompose())

print(soup.p)

prints B

Answer 3

HTML 文档是嵌套的，标签里面有标签。在您提供的文档 ('s') 中，结构如下所示：

Div
   p
     strong
        `text node A`
     `text node B`

递归指示 beautifulsoup 检查特定节点的 children 是否匹配（如果设置为 false，则不检查）。

只有一个根节点（div）。因为你告诉 beautifulsoup 不要递归检查，它不会查看 div 的 children，所以它 returns None 因为没有根'p' 个元素。
这实际上是 'find' 的两个实例被链接在一起。第一个 'find' 查找 'p' （并递归查找，因为递归的默认值为 True）。它找到了我们预期的 'div>p'。在此之后，您在第一次查找的结果上再次调用了 'find'，由于您没有指定要查找的节点类型，因此它正在搜索任何内容。 'p' 的第一个 child 是 'strong' 标签，所以这就是返回的内容。

How to understand "recursive" with BeautifulSoup in Python