如何使用 Beautifulsoup 解析来自相同 class 的信息?
How to parse information from same class using Beautifulsoup?
假设我有以下 HTML
html_doc = """
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class = "Box1">
<span class = "catagory">Plant</span>
<div class = "Box2">
<span class = "sub-catagory">Trees</span>
<div class = "characters">
<div class = "font-medium">1.2</div>
<div class = "font-medium">1.6</div>
<div class = "font-medium">1.7</div>
<div class = "font-medium">1.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">1.4</div>
</div>
<span class = "sub-catagory">Flowers</span>
<div class = "characters">
<div class = "font-medium">2.2</div>
<div class = "font-medium">3.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">5.8</div>
<div class = "font-medium">6.9</div>
<div class = "font-medium">7.4</div>
</div>
</div>
<span class = "catagory">animals</span>
<div class = "Box2">
<span class = "sub-catagory">human</span>
<div class = "characters">
<div class = "font-medium">7.2</div>
<div class = "font-medium">9.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">3.8</div>
<div class = "font-medium">6.9</div>
<div class = "font-medium">9.4</div>
</div>
<span class = "sub-catagory">dog</span>
<div class = "characters">
<div class = "font-medium">4.2</div>
<div class = "font-medium">5.6</div>
<div class = "font-medium">6.7</div>
<div class = "font-medium">1.8</div>
<div class = "font-medium">3.9</div>
<div class = "font-medium">8.4</div>
</div>
</div>
<span class = "catagory">non-living</span>
<div class = "Box2">
<span class = "sub-catagory">rock</span>
<div class = "characters">
<div class = "font-medium">1.2</div>
<div class = "font-medium">1.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">6.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">0.4</div>
</div>
<span class = "sub-catagory">stars</span>
<div class = "characters">
<div class = "font-medium">3.2</div>
<div class = "font-medium">5.6</div>
<div class = "font-medium">2.7</div>
<div class = "font-medium">4.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">2.4</div>
</div>
</div>
</div>
</div>
</body>
</html>
"""
使用 Python 的 BeautifSoup 包,我可以分别获取类别、子类别、字符,如下所示:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
catagories = soup.find_all('span',class_='catatory')
for catatory in categories:
print(catagory.get_text()) #gives the Plant, Animal, non-living
sub-catatories = soup.find_all('span',class_='sub-catatory')
for sub-catatory in sub-categories:
print(sub-catagory.get_text()) # gives me subcategories
measurements = soup.find_all('div',class_='font-medium')
for measurement in measurements:
print(measurement.get_text()) # gives me all the font-medium together.
我不确定如何得到以下结果,因为 div 类 都是一样的。请帮助
植物
树木
1.2
1.6
1.7
1.8
1.9
1.4
花朵
2.2
3.6
4.7
5.8
6.9
7.4
动物
人类
7.2
9.6
4.7
3.8
6.9
9.4
狗
4.2
5.6
6.7
1.8
3.9
8.4
无生命的
岩石
1.2
1.6
4.7
6.8
1.9
0.4
星星
3.2
5.6
2.7
4.8
1.9
2.4
以预期的方式打印您的文本,select 您的 Box1
并使用 get_text()
提取文本,同时将其 seperat / join 参数设置为 \n
:
print(soup.select_one('.Box1').get_text('\n',strip=True))
Plant
Trees
1.2
1.6
1.7
1.8
1.9
1.4
Flowers
2.2
3.6
4.7
5.8
6.9
7.4
animals
...
要获得更结构化的输出,请更改获取元素的方式:
for e in soup.select('span.sub-catagory'):
data.append({
'cat': e.find_previous('span',{'class':'catagory'}).text,
'subcat': e.text,
'characters': list(e.find_next('div').stripped_strings)
})
例子
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
data = []
for e in soup.select('span.sub-catagory'):
print()
data.append({
'cat': e.find_previous('span',{'class':'catagory'}).text,
'subcat': e.text,
'characters': list(e.find_next('div').stripped_strings)
})
data
输出
[{'cat': 'Plant',
'subcat': 'Trees',
'characters': ['1.2', '1.6', '1.7', '1.8', '1.9', '1.4']},
{'cat': 'Plant',
'subcat': 'Flowers',
'characters': ['2.2', '3.6', '4.7', '5.8', '6.9', '7.4']},
{'cat': 'animals',
'subcat': 'human',
'characters': ['7.2', '9.6', '4.7', '3.8', '6.9', '9.4']},
{'cat': 'animals',
'subcat': 'dog',
'characters': ['4.2', '5.6', '6.7', '1.8', '3.9', '8.4']},
{'cat': 'non-living',
'subcat': 'rock',
'characters': ['1.2', '1.6', '4.7', '6.8', '1.9', '0.4']},
{'cat': 'non-living',
'subcat': 'stars',
'characters': ['3.2', '5.6', '2.7', '4.8', '1.9', '2.4']}]
假设我有以下 HTML
html_doc = """
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class = "Box1">
<span class = "catagory">Plant</span>
<div class = "Box2">
<span class = "sub-catagory">Trees</span>
<div class = "characters">
<div class = "font-medium">1.2</div>
<div class = "font-medium">1.6</div>
<div class = "font-medium">1.7</div>
<div class = "font-medium">1.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">1.4</div>
</div>
<span class = "sub-catagory">Flowers</span>
<div class = "characters">
<div class = "font-medium">2.2</div>
<div class = "font-medium">3.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">5.8</div>
<div class = "font-medium">6.9</div>
<div class = "font-medium">7.4</div>
</div>
</div>
<span class = "catagory">animals</span>
<div class = "Box2">
<span class = "sub-catagory">human</span>
<div class = "characters">
<div class = "font-medium">7.2</div>
<div class = "font-medium">9.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">3.8</div>
<div class = "font-medium">6.9</div>
<div class = "font-medium">9.4</div>
</div>
<span class = "sub-catagory">dog</span>
<div class = "characters">
<div class = "font-medium">4.2</div>
<div class = "font-medium">5.6</div>
<div class = "font-medium">6.7</div>
<div class = "font-medium">1.8</div>
<div class = "font-medium">3.9</div>
<div class = "font-medium">8.4</div>
</div>
</div>
<span class = "catagory">non-living</span>
<div class = "Box2">
<span class = "sub-catagory">rock</span>
<div class = "characters">
<div class = "font-medium">1.2</div>
<div class = "font-medium">1.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">6.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">0.4</div>
</div>
<span class = "sub-catagory">stars</span>
<div class = "characters">
<div class = "font-medium">3.2</div>
<div class = "font-medium">5.6</div>
<div class = "font-medium">2.7</div>
<div class = "font-medium">4.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">2.4</div>
</div>
</div>
</div>
</div>
</body>
</html>
"""
使用 Python 的 BeautifSoup 包,我可以分别获取类别、子类别、字符,如下所示:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
catagories = soup.find_all('span',class_='catatory')
for catatory in categories:
print(catagory.get_text()) #gives the Plant, Animal, non-living
sub-catatories = soup.find_all('span',class_='sub-catatory')
for sub-catatory in sub-categories:
print(sub-catagory.get_text()) # gives me subcategories
measurements = soup.find_all('div',class_='font-medium')
for measurement in measurements:
print(measurement.get_text()) # gives me all the font-medium together.
我不确定如何得到以下结果,因为 div 类 都是一样的。请帮助
植物 树木 1.2 1.6 1.7 1.8 1.9 1.4 花朵 2.2 3.6 4.7 5.8 6.9 7.4 动物 人类 7.2 9.6 4.7 3.8 6.9 9.4 狗 4.2 5.6 6.7 1.8 3.9 8.4 无生命的 岩石 1.2 1.6 4.7 6.8 1.9 0.4 星星 3.2 5.6 2.7 4.8 1.9 2.4
以预期的方式打印您的文本,select 您的 Box1
并使用 get_text()
提取文本,同时将其 seperat / join 参数设置为 \n
:
print(soup.select_one('.Box1').get_text('\n',strip=True))
Plant
Trees
1.2
1.6
1.7
1.8
1.9
1.4
Flowers
2.2
3.6
4.7
5.8
6.9
7.4
animals
...
要获得更结构化的输出,请更改获取元素的方式:
for e in soup.select('span.sub-catagory'):
data.append({
'cat': e.find_previous('span',{'class':'catagory'}).text,
'subcat': e.text,
'characters': list(e.find_next('div').stripped_strings)
})
例子
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
data = []
for e in soup.select('span.sub-catagory'):
print()
data.append({
'cat': e.find_previous('span',{'class':'catagory'}).text,
'subcat': e.text,
'characters': list(e.find_next('div').stripped_strings)
})
data
输出
[{'cat': 'Plant',
'subcat': 'Trees',
'characters': ['1.2', '1.6', '1.7', '1.8', '1.9', '1.4']},
{'cat': 'Plant',
'subcat': 'Flowers',
'characters': ['2.2', '3.6', '4.7', '5.8', '6.9', '7.4']},
{'cat': 'animals',
'subcat': 'human',
'characters': ['7.2', '9.6', '4.7', '3.8', '6.9', '9.4']},
{'cat': 'animals',
'subcat': 'dog',
'characters': ['4.2', '5.6', '6.7', '1.8', '3.9', '8.4']},
{'cat': 'non-living',
'subcat': 'rock',
'characters': ['1.2', '1.6', '4.7', '6.8', '1.9', '0.4']},
{'cat': 'non-living',
'subcat': 'stars',
'characters': ['3.2', '5.6', '2.7', '4.8', '1.9', '2.4']}]