Beautiful Soup 在 html 代码中可以看到所有其他标签时只提取一个标签
Beautiful Soup only extracting one tag when can see all the others in the html code
试图了解网络抓取的工作原理:
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
然而,当我运行这一切时,returns 是网站的最终价格($1799.00)。为什么它会跳过所有其他 h4 标签而只跳过 return 最后一个标签?
如有任何帮助,我们将不胜感激!
如果您需要更多信息,请告诉我
会发生什么?
您在最终迭代结果后调用 print()
,这就是为什么您只得到最后一个结果的原因。
如何修复?
将 print()
放入循环中
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
输出
5.99
9.00
9.00
6.99
1.94
6.49
4.46
2.70
9.94
9.95
1.48
3.88
9.00
9.99
4.23
8.98
9.63
0.46
0.66
6.99
3.30
6.29
6.29
9.73
4.62
4.73
7.38
5.95
8.56
9.10
4.23
5.90
7.80
8.64
8.78
4.71
7.17
8.23
0.99
4.98
7.99
1.99
9.99
9.00
9.00
9.00
9.99
5.99
9.00
9.00
9.00
9.00
33.99
96.02
98.42
99.00
99.00
01.83
02.66
10.14
12.91
14.55
23.87
23.87
24.20
33.82
33.91
39.54
40.62
43.40
44.20
44.40
49.00
49.00
49.73
54.04
70.10
78.19
78.99
79.00
87.88
87.98
99.00
99.00
99.73
03.41
12.16
21.58
23.99
35.49
38.37
39.20
44.99
59.00
60.13
71.06
73.11
81.99
94.74
99.00
10.39
11.99
26.83
33.00
37.28
38.37
41.22
47.78
49.23
62.24
66.32
81.13
99.00
99.00
69.00
69.00
99.00
例子
与其在迭代时打印结果,不如将它们结构化地存储在字典列表中,并在 for 循环之后打印或保存它
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
data = []
for item in items:
data.append({
'caption' : item.a['title'],
'price' : item.find('h4', {'class': 'pull-right price'}).string
})
print(data)
试图了解网络抓取的工作原理:
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
然而,当我运行这一切时,returns 是网站的最终价格($1799.00)。为什么它会跳过所有其他 h4 标签而只跳过 return 最后一个标签?
如有任何帮助,我们将不胜感激!
如果您需要更多信息,请告诉我
会发生什么?
您在最终迭代结果后调用 print()
,这就是为什么您只得到最后一个结果的原因。
如何修复?
将 print()
放入循环中
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
输出
5.99
9.00
9.00
6.99
1.94
6.49
4.46
2.70
9.94
9.95
1.48
3.88
9.00
9.99
4.23
8.98
9.63
0.46
0.66
6.99
3.30
6.29
6.29
9.73
4.62
4.73
7.38
5.95
8.56
9.10
4.23
5.90
7.80
8.64
8.78
4.71
7.17
8.23
0.99
4.98
7.99
1.99
9.99
9.00
9.00
9.00
9.99
5.99
9.00
9.00
9.00
9.00
33.99
96.02
98.42
99.00
99.00
01.83
02.66
10.14
12.91
14.55
23.87
23.87
24.20
33.82
33.91
39.54
40.62
43.40
44.20
44.40
49.00
49.00
49.73
54.04
70.10
78.19
78.99
79.00
87.88
87.98
99.00
99.00
99.73
03.41
12.16
21.58
23.99
35.49
38.37
39.20
44.99
59.00
60.13
71.06
73.11
81.99
94.74
99.00
10.39
11.99
26.83
33.00
37.28
38.37
41.22
47.78
49.23
62.24
66.32
81.13
99.00
99.00
69.00
69.00
99.00
例子
与其在迭代时打印结果,不如将它们结构化地存储在字典列表中,并在 for 循环之后打印或保存它
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
data = []
for item in items:
data.append({
'caption' : item.a['title'],
'price' : item.find('h4', {'class': 'pull-right price'}).string
})
print(data)