在 python 中使用 beautifulsoup 获取 href url
Getting href urls using beautifulsoup in python
我正在尝试从以下 url: https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices 下载所有 csv 文件,但不幸的是我无法如预期的那样成功。这是我的尝试:
soup = BeautifulSoup(page.content, "html.parser")
market_dataset = soup.findAll("table",{"class":"table table-striped table-condensed table-clean"})
for a in market_dataset.find_all('a', href=True):
print("Found the URL:", a['href'])
谁能帮帮我。如何获取所有 url 个 csv 文件。
Select 您的元素更具体,例如使用 css selectors
并注意您必须将 href
与 baseUrl
:
连接起来
['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
或者简单地更改您的代码并使用 find()
而不是 findAll()
来定位 table,导致以下属性错误的原因:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
market_dataset = soup.find("table",{"class":"table table-striped table-condensed table-clean"})
注意: 在新代码中使用严格的 find_all()
而不是旧语法 findAll()
或两者的混合。
例子
from bs4 import BeautifulSoup
import requests
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
输出
['https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220318_FinalEnergyPrices_I.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220317_FinalEnergyPrices_I.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220316_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220315_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220314_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220313_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220312_FinalEnergyPrices.csv',...]
我正在尝试从以下 url: https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices 下载所有 csv 文件,但不幸的是我无法如预期的那样成功。这是我的尝试:
soup = BeautifulSoup(page.content, "html.parser")
market_dataset = soup.findAll("table",{"class":"table table-striped table-condensed table-clean"})
for a in market_dataset.find_all('a', href=True):
print("Found the URL:", a['href'])
谁能帮帮我。如何获取所有 url 个 csv 文件。
Select 您的元素更具体,例如使用 css selectors
并注意您必须将 href
与 baseUrl
:
['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
或者简单地更改您的代码并使用 find()
而不是 findAll()
来定位 table,导致以下属性错误的原因:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
market_dataset = soup.find("table",{"class":"table table-striped table-condensed table-clean"})
注意: 在新代码中使用严格的 find_all()
而不是旧语法 findAll()
或两者的混合。
例子
from bs4 import BeautifulSoup
import requests
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
输出
['https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220318_FinalEnergyPrices_I.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220317_FinalEnergyPrices_I.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220316_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220315_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220314_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220313_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220312_FinalEnergyPrices.csv',...]