所有 <p class="blah"> 'under' 每个 H2

All <p class="blah"> 'under' each H2

首先:我知道 <p>s 并不是真正的 'under' <h2>s,而是这里的兄弟姐妹。我只需要在标题中表达这个想法。

我的样本 HTML 看起来像这样:

<h1>Wildlife near me</h1>

<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>

<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>

我一直在尝试使用 BeautifulSoup 创建一个仅包含 'key' 信息的列表(是的,Markdown,但与此处无关),例如:

# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!

# Snakes
## Eastern Brown
A very aggressive and venomous snake

我如何获得所有 <p class="wildlife">Grey Kangaroo</p> 及其下一个同级...对于每个 h2?我试过这个:

for h2 in soup.find_all('h2'):
    print("#### ",h2.text)
    x = h2.find_next_siblings('p', class_='wildlife')
    for item in x:
        print("*",item.text,"*",sep="")
        print(item.find_next_sibling('p').text)
        print("    ")
    print("---")

但是第一个也'deep'(添加第二个的数据),然后是第二个H2。

####  Animals
*Grey Kangaroo*
A bit about kangaroos
    
*Koala*
These are NOT bears!
    
*Eastern Brown*
A very aggressive and venomous snake
    
---
####  Snakes
*Eastern Brown*
A very aggressive and venomous snake
    
---

这能做到吗?谢谢。

我喜欢dicts存储可以在以后的处理中重复使用的结构化信息。

所以我 select 所有 <p>class 命名为 .wildlife 并迭代到 find_previous('h2')find_next('p') 并将信息存储在 data:

data = {}

for w in soup.select('h2~.wildlife'):
    
    if w.find_previous('h2').text not in data:
        data[w.find_previous('h2').text] = []
        
    data[w.find_previous('h2').text].append({
        'animal' : w.text,
        'note' : w.find_next('p').text
    })

现在您可以按照自己喜欢的方式处理数据了:

for x in data:
    print('# '+ x)
    for a in data[x]:
        print('## ' + a['animal'])
        print(a['note'])
    print('------------------')

例子

import requests
from bs4 import BeautifulSoup

html='''
<h1>Wildlife near me</h1>

<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>

<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>
'''

soup = BeautifulSoup(html, 'lxml')

data = {}

for w in soup.select('h2~.wildlife'):
    
    if w.find_previous('h2').text not in data:
        data[w.find_previous('h2').text] = []
        
    data[w.find_previous('h2').text].append({
        'animal' : w.text,
        'note' : w.find_next('p').text
    })
    

for x in data:
    print('# '+ x)
    for a in data[x]:
        print('## ' + a['animal'])
        print(a['note'])
    print('------------------')

输出

# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!
------------------
# Snakes
## Eastern Brown
A very aggressive and venomous snake
------------------

编辑

如果你只是想直接打印,你可以选择:

data = []

for w in soup.select('.wildlife'):
    
    h2 = w.find_previous('h2').text
    
    if h2 not in data:
        data.append(h2)
        print('------------------')
        print('# ' + h2)
        
    print ('## ' + w.text)
    print(w.find_next('p').text)