Python - 如何提取多个标签之间的元素
Python - How to extract elements between multiple tags
工作HTML:
<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>
问题:
我想在每个 h2
标签之间提取 h3
标签,并在 h3
标签
之间提取所有 anchors
我有:
soup = BeautifulSoup("""<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>""", 'html5lib')
for row in soup.find_all("h2"):
print(row.text)
print(row.find_next('h3'))
print('################')
当前结果:
################
Heading 1
<h3> Subheading 1.1 </h3>
################
Heading 2
<h3> Subheading 2.1 </h3>
################
Heading 3
None
################
想要的结果:
################
Heading 1
Subheading 1.1
Link 1
Link 2
Link 3
--------
Subheading 1.2
Link 1
Link 2
Link 3
Link 4
--------
Subheading 1.3
Link 1
################
Heading 2
Subheading 2.1
Link 1
Link 2
--------
Subheading 2.2
Link 1
Link 2
--------
Subheading 2.3
Link 1
################
或类似的东西
这有效!
s = """
<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>
"""
from bs4 import BeautifulSoup as bs
soup = bs(s)
for i in soup.find_all('h2'):
print i.text
for j in i.next_siblings:
if j.name == 'h2': break
if j.name == 'h3':
print '\t'+j.text
for k in j.next_siblings:
if k.name == 'h3': break
if k.name == 'a':
print '\t\t'+k.text
工作HTML:
<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>
问题:
我想在每个 h2
标签之间提取 h3
标签,并在 h3
标签
anchors
我有:
soup = BeautifulSoup("""<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>""", 'html5lib')
for row in soup.find_all("h2"):
print(row.text)
print(row.find_next('h3'))
print('################')
当前结果:
################
Heading 1
<h3> Subheading 1.1 </h3>
################
Heading 2
<h3> Subheading 2.1 </h3>
################
Heading 3
None
################
想要的结果:
################
Heading 1
Subheading 1.1
Link 1
Link 2
Link 3
--------
Subheading 1.2
Link 1
Link 2
Link 3
Link 4
--------
Subheading 1.3
Link 1
################
Heading 2
Subheading 2.1
Link 1
Link 2
--------
Subheading 2.2
Link 1
Link 2
--------
Subheading 2.3
Link 1
################
或类似的东西
这有效!
s = """
<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>
"""
from bs4 import BeautifulSoup as bs
soup = bs(s)
for i in soup.find_all('h2'):
print i.text
for j in i.next_siblings:
if j.name == 'h2': break
if j.name == 'h3':
print '\t'+j.text
for k in j.next_siblings:
if k.name == 'h3': break
if k.name == 'a':
print '\t\t'+k.text