所有 <p class="blah"> 'under' 每个 H2
All <p class="blah"> 'under' each H2
首先:我知道 <p>s
并不是真正的 'under' <h2>s
,而是这里的兄弟姐妹。我只需要在标题中表达这个想法。
我的样本 HTML 看起来像这样:
<h1>Wildlife near me</h1>
<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>
<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>
我一直在尝试使用 BeautifulSoup 创建一个仅包含 'key' 信息的列表(是的,Markdown,但与此处无关),例如:
# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!
# Snakes
## Eastern Brown
A very aggressive and venomous snake
我如何获得所有 <p class="wildlife">Grey Kangaroo</p>
及其下一个同级...对于每个 h2?我试过这个:
for h2 in soup.find_all('h2'):
print("#### ",h2.text)
x = h2.find_next_siblings('p', class_='wildlife')
for item in x:
print("*",item.text,"*",sep="")
print(item.find_next_sibling('p').text)
print(" ")
print("---")
但是第一个也'deep'(添加第二个的数据),然后是第二个H2。
#### Animals
*Grey Kangaroo*
A bit about kangaroos
*Koala*
These are NOT bears!
*Eastern Brown*
A very aggressive and venomous snake
---
#### Snakes
*Eastern Brown*
A very aggressive and venomous snake
---
这能做到吗?谢谢。
我喜欢dicts
存储可以在以后的处理中重复使用的结构化信息。
所以我 select 所有 <p>
和 class
命名为 .wildlife
并迭代到 find_previous('h2')
和 find_next('p')
并将信息存储在 data
:
data = {}
for w in soup.select('h2~.wildlife'):
if w.find_previous('h2').text not in data:
data[w.find_previous('h2').text] = []
data[w.find_previous('h2').text].append({
'animal' : w.text,
'note' : w.find_next('p').text
})
现在您可以按照自己喜欢的方式处理数据了:
for x in data:
print('# '+ x)
for a in data[x]:
print('## ' + a['animal'])
print(a['note'])
print('------------------')
例子
import requests
from bs4 import BeautifulSoup
html='''
<h1>Wildlife near me</h1>
<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>
<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>
'''
soup = BeautifulSoup(html, 'lxml')
data = {}
for w in soup.select('h2~.wildlife'):
if w.find_previous('h2').text not in data:
data[w.find_previous('h2').text] = []
data[w.find_previous('h2').text].append({
'animal' : w.text,
'note' : w.find_next('p').text
})
for x in data:
print('# '+ x)
for a in data[x]:
print('## ' + a['animal'])
print(a['note'])
print('------------------')
输出
# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!
------------------
# Snakes
## Eastern Brown
A very aggressive and venomous snake
------------------
编辑
如果你只是想直接打印,你可以选择:
data = []
for w in soup.select('.wildlife'):
h2 = w.find_previous('h2').text
if h2 not in data:
data.append(h2)
print('------------------')
print('# ' + h2)
print ('## ' + w.text)
print(w.find_next('p').text)
首先:我知道 <p>s
并不是真正的 'under' <h2>s
,而是这里的兄弟姐妹。我只需要在标题中表达这个想法。
我的样本 HTML 看起来像这样:
<h1>Wildlife near me</h1>
<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>
<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>
我一直在尝试使用 BeautifulSoup 创建一个仅包含 'key' 信息的列表(是的,Markdown,但与此处无关),例如:
# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!
# Snakes
## Eastern Brown
A very aggressive and venomous snake
我如何获得所有 <p class="wildlife">Grey Kangaroo</p>
及其下一个同级...对于每个 h2?我试过这个:
for h2 in soup.find_all('h2'):
print("#### ",h2.text)
x = h2.find_next_siblings('p', class_='wildlife')
for item in x:
print("*",item.text,"*",sep="")
print(item.find_next_sibling('p').text)
print(" ")
print("---")
但是第一个也'deep'(添加第二个的数据),然后是第二个H2。
#### Animals
*Grey Kangaroo*
A bit about kangaroos
*Koala*
These are NOT bears!
*Eastern Brown*
A very aggressive and venomous snake
---
#### Snakes
*Eastern Brown*
A very aggressive and venomous snake
---
这能做到吗?谢谢。
我喜欢dicts
存储可以在以后的处理中重复使用的结构化信息。
所以我 select 所有 <p>
和 class
命名为 .wildlife
并迭代到 find_previous('h2')
和 find_next('p')
并将信息存储在 data
:
data = {}
for w in soup.select('h2~.wildlife'):
if w.find_previous('h2').text not in data:
data[w.find_previous('h2').text] = []
data[w.find_previous('h2').text].append({
'animal' : w.text,
'note' : w.find_next('p').text
})
现在您可以按照自己喜欢的方式处理数据了:
for x in data:
print('# '+ x)
for a in data[x]:
print('## ' + a['animal'])
print(a['note'])
print('------------------')
例子
import requests
from bs4 import BeautifulSoup
html='''
<h1>Wildlife near me</h1>
<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>
<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>
'''
soup = BeautifulSoup(html, 'lxml')
data = {}
for w in soup.select('h2~.wildlife'):
if w.find_previous('h2').text not in data:
data[w.find_previous('h2').text] = []
data[w.find_previous('h2').text].append({
'animal' : w.text,
'note' : w.find_next('p').text
})
for x in data:
print('# '+ x)
for a in data[x]:
print('## ' + a['animal'])
print(a['note'])
print('------------------')
输出
# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!
------------------
# Snakes
## Eastern Brown
A very aggressive and venomous snake
------------------
编辑
如果你只是想直接打印,你可以选择:
data = []
for w in soup.select('.wildlife'):
h2 = w.find_previous('h2').text
if h2 not in data:
data.append(h2)
print('------------------')
print('# ' + h2)
print ('## ' + w.text)
print(w.find_next('p').text)