使用 BeautifulSoup 获取 href
Getting href with BeautfulSoup
我正在尝试从 table 获取信息。 TD 中有一些链接,在这种情况下,我会检索 href="" 内容而不是 TD 文本本身。这是我一直在使用的代码。
url = 'http://www.milavia.net/airshows/calendar/showdates-2020-world.html'
page = session.get(url)
soup = BeautifulSoup(page.content, 'lxml')
tableOutput = []
for row in soup.find_all('tr')[1:]:
date, event, location, website, facebook, feature, notes = row.find_all('td')[0:7]
# print(website)
p = {
'Date': date.text.strip(),
'Event': event.text.strip(),
'Location': location.text.strip(),
# 'Site': website.text.strip(),
'Site': website.select('a', href=True, text='TEXT'),
'Facebook': facebook.text.strip(),
'Featuring': feature.text.strip(),
'Notes': notes.text.strip()
}
tableOutput.append(p)
print(tableOutput)
这是输出:
[{'Data': '15-18 Jan', 'Evento': 'Kuwait Aviation Show', 'Local': 'Kuwait International Airport, Kuwait', 'Site': [<a class="asclnk" href="http://kuwaitaviationshow.com/" target="airshow" title="Visit Kuwait Aviation Show Website: kuwaitaviationshow.com">link</a>], 'Facebook': '', 'Atração': '', 'Obs.': 'public 17-18'}, {'Data': '18 Jan', 'Evento': 'Classics of the Sky Tauranga City Airshow', 'Local': 'Tauranga, New Zealand', 'Site': [<a class="asclnk" href="http://www.tcas.nz" target="airshow" title="Visit Classics of the Sky Tauranga City Airshow Website: www.tcas.nz">link</a>], 'Facebook': '', 'Atração': '', 'Obs.': ''}, {'Data': 'Date', 'Evento': 'Event', 'Local': 'Location', 'Site': [], 'Facebook': 'Facebook', 'Atração': 'Feature', 'Obs.': 'Notes'}]
我无法只获取 href 中的文本,例如
a class="asclnk" href="http://www.tcas.nz" target="airshow" 标题="Visit Classics of the Sky Tauranga City Airshow Website: www.tcas.nz">
我尝试了一些使用 website.select() 或 website.find() 的方法,但其中 none 给出了我需要的结果。
任何帮助都会很棒。谢谢
您尝试过的引用 link 之所以失败,是因为您有迭代行,而某些行没有锚标记 href 属性,因此它失败了。我现在已经为 check.Try 提供了一个 if 条件。
import requests
from bs4 import BeautifulSoup
url = 'http://www.milavia.net/airshows/calendar/showdates-2020-world.html'
session=requests.session()
page = session.get(url)
soup = BeautifulSoup(page.content, 'lxml')
tableOutput = []
for row in soup.find_all('tr')[1:]:
date, event, location, website, facebook, feature, notes = row.find_all('td')[0:7]
if website.select_one('a[href]'):
p = {
'Date': date.text.strip(),
'Event': event.text.strip(),
'Location': location.text.strip(),
# 'Site': website.text.strip(),
'Site': website.select_one('a[href]')['href'],
'Facebook': facebook.text.strip(),
'Featuring': feature.text.strip(),
'Notes': notes.text.strip()
}
tableOutput.append(p)
print(tableOutput)
输出:
[{'Featuring': '', 'Location': 'Kuwait International Airport, Kuwait', 'Site': 'http://kuwaitaviationshow.com/', 'Date': '15-18 Jan', 'Facebook': '', 'Event': 'Kuwait Aviation Show', 'Notes': 'public 17-18'}, {'Featuring': '', 'Location': 'Tauranga, New Zealand', 'Site': 'http://www.tcas.nz', 'Date': '18 Jan', 'Facebook': '', 'Event': 'Classics of the Sky Tauranga City Airshow', 'Notes': ''}, {'Featuring': '', 'Location': 'Lucknow, Uttar Pradesh, India', 'Site': 'https://defexpo.gov.in/', 'Date': '05-08 Feb', 'Facebook': '', 'Event': 'Defexpo India 2020', 'Notes': 'public Sat. 8th'}, {'Featuring': '', 'Location': 'Changi Exhibition Centre, Singapore', 'Site': 'http://www.singaporeairshow.com/', 'Date': '11-16 Feb', 'Facebook': '', 'Event': 'Singapore Airshow 2020', 'Notes': 'public Sat-SunReports: 2018 2014'}, {'Featuring': '', 'Location': 'Al Bateen Executive Airport, Abu Dhabi, United Arab Emirates', 'Site': 'http://www.adairexpo.com/', 'Date': '04-06 Mar', 'Facebook': '', 'Event': 'Abu Dhabi Air Expo & Heli Expo 2020', 'Notes': 'trade expo'}, {'Featuring': '', 'Location': "Djerba–Zarzis Int'l Airport, Djerba, Tunisia", 'Site': 'http://www.iadetunisia.com/en/', 'Date': '04-08 Mar', 'Facebook': '', 'Event': 'IADE Tunisia 2020', 'Notes': 'public days 7-8'}, {'Featuring': '', 'Location': 'Tyabb Airport, Tyabb VIC, Australia', 'Site': 'http://www.tyabbairshow.com/', 'Date': '08 Mar', 'Facebook': '', 'Event': 'Tyabb Air Show 2020', 'Notes': ''}, {'Featuring': '', 'Location': 'Echuca Airport, Echuca VIC, Australia', 'Site': 'http://www.antique-aeroplane.com.au/', 'Date': '20-22 Mar', 'Facebook': '', 'Event': 'AAAA National Fly-in', 'Notes': ''}, {'Featuring': '', 'Location': "Santiago Int'l Airport, Santiago, Chile", 'Site': 'http://www.fidae.cl/', 'Date': '31 Mar / 05 Apr', 'Facebook': '', 'Event': 'FIDAE 2020', 'Notes': 'public Apr 4-5'}, {'Featuring': '', 'Location': "Santiago Int'l Airport, Santiago, Chile", 'Site': 'http://www.fidae.cl/', 'Date': '31 Mar / 05 Apr', 'Facebook': '', 'Event': 'FIDAE 2020', 'Notes': 'public Apr 4-5'}, {'Featuring': '', 'Location': 'Wanaka Airport, Otago, New Zealand', 'Site': 'http://www.warbirdsoverwanaka.com/', 'Date': '11-13 Apr', 'Facebook': '', 'Event': 'Warbirds Over Wanaka 2020', 'Notes': 'Report 2010'}, {'Featuring': '', 'Location': 'Illawarra Regional Airport, Wollongong NSW, Australia', 'Site': 'http://www.woi.org.au/', 'Date': '02-03 May', 'Facebook': '', 'Event': 'Wings over Illawarra', 'Notes': ''}, {'Featuring': '', 'Location': 'AFB Waterkloof, Centurion, South Africa', 'Site': 'http://www.aadexpo.co.za/', 'Date': '16-20 Sep', 'Facebook': '', 'Event': 'Africa Aerospace & Defence - AAD 2020', 'Notes': 'public 19-20'}, {'Featuring': '', 'Location': 'JIExpo Kemayoran, Jakarta, Indonesia', 'Site': 'http://www.indoaerospace.com/', 'Date': '04-07 Nov', 'Facebook': '', 'Event': 'Indo Aerospace 2020', 'Notes': 'trade only'}, {'Featuring': '', 'Location': 'Zhuhai, Guangdong, China', 'Site': 'http://www.airshow.com.cn/', 'Date': '10-15 Nov', 'Facebook': '', 'Event': 'Airshow China 2020', 'Notes': 'public 13-15th'}, {'Featuring': '', 'Location': 'Sakhir Air Base, Bahrain', 'Site': 'http://www.bahraininternationalairshow.com/', 'Date': '18-20 Nov', 'Facebook': '', 'Event': 'Bahrain International Airshow BIAS 2020', 'Notes': ''}]
预先使用 css 进行过滤。使用 bs4 4.7.1,您可以确保仅使用包含这些链接的行,方法是使用 :has
。这减少了代码行并消除了对索引的需要。如果您使用 select
,您可以利用 limit
参数。
import requests
from bs4 import BeautifulSoup
url = 'http://www.milavia.net/airshows/calendar/showdates-2020-world.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
tableOutput = []
for row in soup.select('tr:has(.asclnk[href])'):
date, event, location, website, facebook, feature, notes = row.select('td',limit=7)
p = {
'Date': date.text.strip(),
'Event': event.text.strip(),
'Location': location.text.strip(),
'Site': website.select_one('a[href]')['href'],
'Facebook': facebook.text.strip(),
'Featuring': feature.text.strip(),
'Notes': notes.text.strip()
}
tableOutput.append(p)
print(tableOutput)
我正在尝试从 table 获取信息。 TD 中有一些链接,在这种情况下,我会检索 href="" 内容而不是 TD 文本本身。这是我一直在使用的代码。
url = 'http://www.milavia.net/airshows/calendar/showdates-2020-world.html'
page = session.get(url)
soup = BeautifulSoup(page.content, 'lxml')
tableOutput = []
for row in soup.find_all('tr')[1:]:
date, event, location, website, facebook, feature, notes = row.find_all('td')[0:7]
# print(website)
p = {
'Date': date.text.strip(),
'Event': event.text.strip(),
'Location': location.text.strip(),
# 'Site': website.text.strip(),
'Site': website.select('a', href=True, text='TEXT'),
'Facebook': facebook.text.strip(),
'Featuring': feature.text.strip(),
'Notes': notes.text.strip()
}
tableOutput.append(p)
print(tableOutput)
这是输出:
[{'Data': '15-18 Jan', 'Evento': 'Kuwait Aviation Show', 'Local': 'Kuwait International Airport, Kuwait', 'Site': [<a class="asclnk" href="http://kuwaitaviationshow.com/" target="airshow" title="Visit Kuwait Aviation Show Website: kuwaitaviationshow.com">link</a>], 'Facebook': '', 'Atração': '', 'Obs.': 'public 17-18'}, {'Data': '18 Jan', 'Evento': 'Classics of the Sky Tauranga City Airshow', 'Local': 'Tauranga, New Zealand', 'Site': [<a class="asclnk" href="http://www.tcas.nz" target="airshow" title="Visit Classics of the Sky Tauranga City Airshow Website: www.tcas.nz">link</a>], 'Facebook': '', 'Atração': '', 'Obs.': ''}, {'Data': 'Date', 'Evento': 'Event', 'Local': 'Location', 'Site': [], 'Facebook': 'Facebook', 'Atração': 'Feature', 'Obs.': 'Notes'}]
我无法只获取 href 中的文本,例如
a class="asclnk" href="http://www.tcas.nz" target="airshow" 标题="Visit Classics of the Sky Tauranga City Airshow Website: www.tcas.nz">
我尝试了一些使用 website.select() 或 website.find() 的方法,但其中 none 给出了我需要的结果。
任何帮助都会很棒。谢谢
您尝试过的引用 link 之所以失败,是因为您有迭代行,而某些行没有锚标记 href 属性,因此它失败了。我现在已经为 check.Try 提供了一个 if 条件。
import requests
from bs4 import BeautifulSoup
url = 'http://www.milavia.net/airshows/calendar/showdates-2020-world.html'
session=requests.session()
page = session.get(url)
soup = BeautifulSoup(page.content, 'lxml')
tableOutput = []
for row in soup.find_all('tr')[1:]:
date, event, location, website, facebook, feature, notes = row.find_all('td')[0:7]
if website.select_one('a[href]'):
p = {
'Date': date.text.strip(),
'Event': event.text.strip(),
'Location': location.text.strip(),
# 'Site': website.text.strip(),
'Site': website.select_one('a[href]')['href'],
'Facebook': facebook.text.strip(),
'Featuring': feature.text.strip(),
'Notes': notes.text.strip()
}
tableOutput.append(p)
print(tableOutput)
输出:
[{'Featuring': '', 'Location': 'Kuwait International Airport, Kuwait', 'Site': 'http://kuwaitaviationshow.com/', 'Date': '15-18 Jan', 'Facebook': '', 'Event': 'Kuwait Aviation Show', 'Notes': 'public 17-18'}, {'Featuring': '', 'Location': 'Tauranga, New Zealand', 'Site': 'http://www.tcas.nz', 'Date': '18 Jan', 'Facebook': '', 'Event': 'Classics of the Sky Tauranga City Airshow', 'Notes': ''}, {'Featuring': '', 'Location': 'Lucknow, Uttar Pradesh, India', 'Site': 'https://defexpo.gov.in/', 'Date': '05-08 Feb', 'Facebook': '', 'Event': 'Defexpo India 2020', 'Notes': 'public Sat. 8th'}, {'Featuring': '', 'Location': 'Changi Exhibition Centre, Singapore', 'Site': 'http://www.singaporeairshow.com/', 'Date': '11-16 Feb', 'Facebook': '', 'Event': 'Singapore Airshow 2020', 'Notes': 'public Sat-SunReports: 2018 2014'}, {'Featuring': '', 'Location': 'Al Bateen Executive Airport, Abu Dhabi, United Arab Emirates', 'Site': 'http://www.adairexpo.com/', 'Date': '04-06 Mar', 'Facebook': '', 'Event': 'Abu Dhabi Air Expo & Heli Expo 2020', 'Notes': 'trade expo'}, {'Featuring': '', 'Location': "Djerba–Zarzis Int'l Airport, Djerba, Tunisia", 'Site': 'http://www.iadetunisia.com/en/', 'Date': '04-08 Mar', 'Facebook': '', 'Event': 'IADE Tunisia 2020', 'Notes': 'public days 7-8'}, {'Featuring': '', 'Location': 'Tyabb Airport, Tyabb VIC, Australia', 'Site': 'http://www.tyabbairshow.com/', 'Date': '08 Mar', 'Facebook': '', 'Event': 'Tyabb Air Show 2020', 'Notes': ''}, {'Featuring': '', 'Location': 'Echuca Airport, Echuca VIC, Australia', 'Site': 'http://www.antique-aeroplane.com.au/', 'Date': '20-22 Mar', 'Facebook': '', 'Event': 'AAAA National Fly-in', 'Notes': ''}, {'Featuring': '', 'Location': "Santiago Int'l Airport, Santiago, Chile", 'Site': 'http://www.fidae.cl/', 'Date': '31 Mar / 05 Apr', 'Facebook': '', 'Event': 'FIDAE 2020', 'Notes': 'public Apr 4-5'}, {'Featuring': '', 'Location': "Santiago Int'l Airport, Santiago, Chile", 'Site': 'http://www.fidae.cl/', 'Date': '31 Mar / 05 Apr', 'Facebook': '', 'Event': 'FIDAE 2020', 'Notes': 'public Apr 4-5'}, {'Featuring': '', 'Location': 'Wanaka Airport, Otago, New Zealand', 'Site': 'http://www.warbirdsoverwanaka.com/', 'Date': '11-13 Apr', 'Facebook': '', 'Event': 'Warbirds Over Wanaka 2020', 'Notes': 'Report 2010'}, {'Featuring': '', 'Location': 'Illawarra Regional Airport, Wollongong NSW, Australia', 'Site': 'http://www.woi.org.au/', 'Date': '02-03 May', 'Facebook': '', 'Event': 'Wings over Illawarra', 'Notes': ''}, {'Featuring': '', 'Location': 'AFB Waterkloof, Centurion, South Africa', 'Site': 'http://www.aadexpo.co.za/', 'Date': '16-20 Sep', 'Facebook': '', 'Event': 'Africa Aerospace & Defence - AAD 2020', 'Notes': 'public 19-20'}, {'Featuring': '', 'Location': 'JIExpo Kemayoran, Jakarta, Indonesia', 'Site': 'http://www.indoaerospace.com/', 'Date': '04-07 Nov', 'Facebook': '', 'Event': 'Indo Aerospace 2020', 'Notes': 'trade only'}, {'Featuring': '', 'Location': 'Zhuhai, Guangdong, China', 'Site': 'http://www.airshow.com.cn/', 'Date': '10-15 Nov', 'Facebook': '', 'Event': 'Airshow China 2020', 'Notes': 'public 13-15th'}, {'Featuring': '', 'Location': 'Sakhir Air Base, Bahrain', 'Site': 'http://www.bahraininternationalairshow.com/', 'Date': '18-20 Nov', 'Facebook': '', 'Event': 'Bahrain International Airshow BIAS 2020', 'Notes': ''}]
预先使用 css 进行过滤。使用 bs4 4.7.1,您可以确保仅使用包含这些链接的行,方法是使用 :has
。这减少了代码行并消除了对索引的需要。如果您使用 select
,您可以利用 limit
参数。
import requests
from bs4 import BeautifulSoup
url = 'http://www.milavia.net/airshows/calendar/showdates-2020-world.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
tableOutput = []
for row in soup.select('tr:has(.asclnk[href])'):
date, event, location, website, facebook, feature, notes = row.select('td',limit=7)
p = {
'Date': date.text.strip(),
'Event': event.text.strip(),
'Location': location.text.strip(),
'Site': website.select_one('a[href]')['href'],
'Facebook': facebook.text.strip(),
'Featuring': feature.text.strip(),
'Notes': notes.text.strip()
}
tableOutput.append(p)
print(tableOutput)