使用 lxml 构建器进行非递归查找
Non-recursive find with lxml builder
我在 Python 2.7 中发现,如果我使用 lxml
生成器,我无法执行非递归 bs4.BeautifulSoup.find_all
。
以下面的例子为例HTML片段:
<p> <b> Cats </b> are interesting creatures </p>
<p> <b> Dogs </b> are cool too </p>
<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>
<p> <b> Llamas </b> don't live in New York </p>
假设我想找到所有 p
直接子元素。我用 find_all("p", recursive=False)
.
做了一个非递归 find_all
为了对此进行测试,我在一个名为 html
的变量中设置了上面的 HTML 片段。然后,我创建了两个 BeautifulSoup
实例,a
和 b
:
a = bs4.BeautifulSoup(html, "html.parser")
b = bs4.BeautifulSoup(html, "lxml")
它们在正常使用 find_all
时都能正确执行:
>>> a.find_all("p")
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]
>>> b.find_all("p")
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]
但是如果我关闭递归查找,只有 a
有效。 b
returns 一个空列表:
>>> a.find_all("p", recursive=False)
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>]
>>> b.find_all("p", recursive=False)
[]
这是为什么?这是一个错误,还是我做错了什么? lxml
生成器是否支持非递归 find_all
?
这是因为 lxml
解析器会将您的 HTML 代码放入 html/body
(如果它不存在):
>>> b = bs4.BeautifulSoup(html, "lxml")
>>> print(b)
<html><body><p> <b> Cats </b> are interesting creatures </p>
<p> <b> Dogs </b> are cool too </p>
<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>
<p> <b> Llamas </b> don't live in New York </p>
</body></html>
因此,非递归模式下的 find_all()
将尝试在 html
元素内查找元素,该元素只有 body
个子元素:
>>> print(b.find_all("p", recursive=False))
[]
>>> print(b.body.find_all("p", recursive=False))
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>]
我在 Python 2.7 中发现,如果我使用 lxml
生成器,我无法执行非递归 bs4.BeautifulSoup.find_all
。
以下面的例子为例HTML片段:
<p> <b> Cats </b> are interesting creatures </p>
<p> <b> Dogs </b> are cool too </p>
<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>
<p> <b> Llamas </b> don't live in New York </p>
假设我想找到所有 p
直接子元素。我用 find_all("p", recursive=False)
.
find_all
为了对此进行测试,我在一个名为 html
的变量中设置了上面的 HTML 片段。然后,我创建了两个 BeautifulSoup
实例,a
和 b
:
a = bs4.BeautifulSoup(html, "html.parser")
b = bs4.BeautifulSoup(html, "lxml")
它们在正常使用 find_all
时都能正确执行:
>>> a.find_all("p")
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]
>>> b.find_all("p")
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]
但是如果我关闭递归查找,只有 a
有效。 b
returns 一个空列表:
>>> a.find_all("p", recursive=False)
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>]
>>> b.find_all("p", recursive=False)
[]
这是为什么?这是一个错误,还是我做错了什么? lxml
生成器是否支持非递归 find_all
?
这是因为 lxml
解析器会将您的 HTML 代码放入 html/body
(如果它不存在):
>>> b = bs4.BeautifulSoup(html, "lxml")
>>> print(b)
<html><body><p> <b> Cats </b> are interesting creatures </p>
<p> <b> Dogs </b> are cool too </p>
<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>
<p> <b> Llamas </b> don't live in New York </p>
</body></html>
因此,非递归模式下的 find_all()
将尝试在 html
元素内查找元素,该元素只有 body
个子元素:
>>> print(b.find_all("p", recursive=False))
[]
>>> print(b.body.find_all("p", recursive=False))
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>]