查找具有特定 class 的 div 的后代

Question

我正在尝试抓取一个网站并希望获取特定 div class 的所有后代。例如，假设我有一个如下所示的网站：

[编辑：问题的作者在评论中指出所有 div 元素都应该处于同一级别；因此，我冒昧地在此示例代码中关闭了它们。]

<div class = "blah">
    <p></p>
</div>

<div class = "i-want-this">
    <p></p>
    <p><a href= "http://www.google.com"></a></p>
</div>

<div class = "i-want-this">
    <p></p>
    <li></li>
        <p>meh</p>
    <li></li>
</div>

我想要 div class "i-want-this" 每个实例的所有后代并忽略其他 div。我可以在 find_all

中指定那些 div

div = soup.find_all('div', {'class': 'i-want-this'})

但这只是创建了所有这些的列表。我还看到你可以通过

抢后代

soup.div.descendants

但我不知道如何指定要包含 div 个标签中的 class 个标签。如果有任何帮助，我将不胜感激！

Answer 1

这可能是您想要的：

div = soup.find_all('div', {'class': 'i-want-this'})

for e in div:
    print (e.descendents) #or append to list, or whatever you're trying to do.

Answer 2

最终，我想到了这个解决方案。 "children" 对象捕获了 'div' 的所有子对象和后续的子对象；然后我迭代了

children = soup.findChildren('div', {'class': 'i-want-this'})

content = []
for item in children:   
    item = [content for content in item.text.split('\n') if len(content)>0]

    # Create string from separate list items to all be listed in content
    item = ' '.join(item)
    content.append(item)

Answer 3

我将你的 HTML 推到了一个名为 temp.htm 的文件中。

你只需要一小部分 scrapy 来完成这样的任务：Selector。只需将 HTML 填入其中，然后使用其 xpath 方法即可。

在这种情况下，我可以指定两个 div 元素之一，其中 class 感兴趣，按编号，然后询问它的所有后代，然后 extract 里面有什么那些。每种情况下的结果都是 div.

的子元素的列表

>>> from scrapy.selector import Selector
>>> HTML = open('temp.htm').read()
>>> selector = Selector(text=HTML)
>>> selector.xpath('.//div[@class="i-want-this"][1]/*').extract()
['<p></p>', '<p><a href="http://www.google.com"></a></p>']
>>> selector.xpath('.//div[@class="i-want-this"][2]/*').extract()
['<p></p>', '<li>', '<p>meh</p>', '<li>']

The xpath cheatsheet 在这种时候非常有用。

查找具有特定 class 的 div 的后代

Finding descendants of div with specific class

python

beautifulsoup

python-3.x

bs4