如何从我的网络抓取工具中删除 <a href... 标签
How do I remove the <a href... tags from my web scrapper
通过@F.Hoque 提供的 pandas.read_html()
阅读表格可能会更精简,但您也可以仅使用 BeautifulSoup
获得结果。
遍历 <table>
的所有 <tr>
,通过 .text
/ .get_text()
从 tags
中选择信息并将其结构化存储在字典列表中:
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
例子
import requests
from bs4 import BeautifulSoup
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
data
输出
[{'rank': '1.', 'title': 'It Happened One Night', 'releaseYear': '1934'},
{'rank': '2.', 'title': 'Citizen Kane', 'releaseYear': '1941'},
{'rank': '3.', 'title': 'The Wizard of Oz', 'releaseYear': '1939'},
{'rank': '4.', 'title': 'Modern Times', 'releaseYear': '1936'},
{'rank': '5.', 'title': 'Black Panther', 'releaseYear': '2018'},...]
通过@F.Hoque 提供的 pandas.read_html()
阅读表格可能会更精简,但您也可以仅使用 BeautifulSoup
获得结果。
遍历 <table>
的所有 <tr>
,通过 .text
/ .get_text()
从 tags
中选择信息并将其结构化存储在字典列表中:
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
例子
import requests
from bs4 import BeautifulSoup
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
data
输出
[{'rank': '1.', 'title': 'It Happened One Night', 'releaseYear': '1934'},
{'rank': '2.', 'title': 'Citizen Kane', 'releaseYear': '1941'},
{'rank': '3.', 'title': 'The Wizard of Oz', 'releaseYear': '1939'},
{'rank': '4.', 'title': 'Modern Times', 'releaseYear': '1936'},
{'rank': '5.', 'title': 'Black Panther', 'releaseYear': '2018'},...]