如何使用 Beautifulsoup 遍历 Python 中网站的所有标签？

Question

我是这个领域的新手。这是我需要爬取的网站“http://py4e-data.dr-chuck.net/comments_1430669.html”这是它的源代码“view-source:http://py4e-data.dr-chuck.net/comments_1430669.html" 这是一个简单的练习网站。 HTML 代码类似于：

<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>

<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>

我需要得到评论和跨度之间的数字 (100,100,99) 下面是我的代码：

html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()

soup=BeautifulSoup(html,'html.parser')

tag=soup.span

print(tag) #<span class="comments">100</span>
print(tag.string) #100

我得到了数字 100 但只有第一个，现在我想通过遍历列表或类似的东西来获得所有的数字。使用 beautifulsoup 执行此操作的方法是什么？

Answer 1

尝试以下方法：

from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html, 'html.parser')
data = []

for tr in soup.find_all('tr'):
    row = [td.text for td in tr.find_all('td')]
    data.append(row[1])     # or data.append(row) for both
    
print(data)

给你 data 一个只包含一列的列表：

['Comments', '100', '100', '99', '96', '93', '93', '89', '88', '85', '84', '84', '81', '79', '76', '74', '73', '71', '70', '67', '61', '60', '60', '59', '54', '53', '53', '52', '50', '46', '46', '45', '41', '38', '37', '37', '36', '34', '26', '24', '24', '23', '23', '21', '17', '17', '16', '14', '12', '11', '7']

首先找到所有 table <tr> 行。然后提取每一行的所有 <td> 值。由于您只想要第二个，因此将 row[1] 附加到包含您的值的 data 列表中。

如果需要，您可以跳过第一个data[1:]。

这种方法可以让您通过附加整个 row 同时保存名称。例如使用 data.append(row) 代替...

然后您可以使用以下方式显示条目：

for name, comment in data[1:]:
    print(name, comment)

给出输出开始：

Melodie 100
Machaela 100
Rhoan 99
Murrough 96
Lilygrace 93
Ellenor 93
Verity 89
Karlie 88

Answer 2

import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html,'html.parser')
tags = soup.find_all("span")
for i in tags:
  print(i.string)

您可以使用 find_all() 函数，然后对其进行迭代以获取数字。

如果你想要名字，你也可以使用 python 字典：

import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html,'html.parser')
tags = soup.find_all("span")
comments = {}
for index, tag in enumerate(tags):
  commentorName = tag.find_previous('tr').text
  commentorComments = tag.string
  comments[commentorName] = commentorComments
print(comments)

这将为您提供如下输出：

{'Melodie100': '100', 'Machaela100': '100', 'Rhoan99': '99', 'Murrough96': '96', 'Lilygrace93': '93', 'Ellenor93': '93', 'Verity89': '89', 'Karlie88': '88', 'Berlin85': '85', 'Skylar84': '84', 'Benny84': '84', 'Crispin81': '81', 'Asya79': '79', 'Kadi76': '76', 'Dua74': '74', 'Stephany73': '73', 'Eila71': '71', 'Jennah70': '70', 'Eduardo67': '67', 'Shannan61': '61', 'Chymari60': '60', 'Inez60': '60', 'Charlene59': '59', 'Rosalin54': '54', 'James53': '53', 'Rhy53': '53', 'Zein52': '52', 'Ayren50': '50', 'Marissa46': '46', 'Mcbride46': '46', 'Ruben45': '45', 'Mikee41': '41', 'Carmel38': '38', 'Idahosa37': '37', 'Brooklin37': '37', 'Betsy36': '36', 'Kayah34': '34', 'Szymon26': '26', 'Tea24': '24', 'Queenie24': '24', 'Nima23': '23', 'Eassan23': '23', 'Haleema21': '21', 'Rahma17': '17', 'Rob17': '17', 'Roma16': '16', 'Jeffrey14': '14', 'Yorgos12': '12', 'Denon11': '11', 'Jasmina7': '7'}

如何使用 Beautifulsoup 遍历 Python 中网站的所有标签？

How to iterate through all tags of a website in Python with Beautifulsoup?

html

python

loops

beautifulsoup

web-crawler