使用 BeautifulSoup 从 HTML 中提取所有

Question

我还没有在 Whosebug 上找到解决方案。所以我的 HTML 片段是：

<d1>
<dt class="abc">Test</dt><dd><dl>
    <dt>Part1</dt><dd><p>THISISWHATINEED<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
    <dt>Part2</dt><dd><p>THISISWHATINEED2<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
<dt class="abc">Test2</dt><dd><dl>
    <dt>Part3</dt><dd><p>THISISWHATINEED3<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
    <dt>Part4</dt><dd><p>THISISWHATINEED4<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>

那么我如何获得所有适合 <dt class="abc">Test</dt><dd><dl> 的 。我尝试使用 d1.find_all("dt")，但后来我错过了 。我真的不知道如何获得“孩子”。最好的办法是遍历 <dt>，然后在其中遍历 ，例如“Test”（第一部分）。但是我该怎么做呢？你们有什么建议或想法吗？

我已经尝试过的：

        d1 = soup.find_all("dl")
        for child in d1.children:
            print(child)

还有很多其他我脑子里想不起来的东西..

另一种方法效果很好：

            for child in d1.children:
                if child.string is not None:
                    continue
                if child.string is None:
                    xx= len(child.find_all("p"))

谢谢！

您好尼克

Answer 1

尝试使用 adjecent sibling (+) CSS select 或者，这将 select 一个立即跟随另一个。

要使用 CSS select 或者，使用 .select() 方法而不是 find_all()。

在你的例子中：

for tag in soup.select(".abc +dd dt +dd p"):
    print(tag.contents[0])

.abc 是 class-name，因此将 abc 替换为实际的 class
由于标签内有多个属性，使用.contents[0]获取需要的元素

输出：

THISISWHATINEED1
THISISWHATINEED2
THISISWHATINEED3
THISISWHATINEED4

使用 BeautifulSoup 从 HTML 中提取所有 <p>

Extract all <p> from HTML with BeautifulSoup

python

beautifulsoup

web-crawler

html-parsing