如何使用 BeautifulSoup 在 span h5 a href link 中提取标题
How to extract title inside span h5 a href link using BeautifulSoup
我正在尝试使用 BeautifulSoup 提取 link 的标题。我正在使用的代码如下:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
hdr={'User-Agent':'Chrome/84.0.4147.135'}
frame=[]
for page_number in range(19):
http= "https://www.epa.wa.gov.au/media-statements?page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.content, 'html.parser')
for row in soup.select('.view-content .views-row'):
content = row.select_one('.views-field-body').get_text(strip=True)
title = row.text.strip(':')
link = 'https://www.epa.wa.gov.au' + row.a['href']
date = row.select_one('.date-display-single').get_text(strip=True)
frame.append({
'title': title,
'link': link,
'date': date,
'content': content
})
dfs = pd.DataFrame(frame)
dfs.to_csv('epa_scrapper.csv',index=False,encoding='utf-8-sig')
但是,在我运行上面的代码之后,什么也没有显示。如何提取存储在 link 中的锚标记的 title 属性中的值?
此外,我只想知道如何将“title”、“link”、“dt”、“content”附加到 csv 文件中。
在此先感谢您。
要获取 link 文本,您可以使用选择器 "h5 a"
。例如:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
hdr={'User-Agent':'Chrome/84.0.4147.135'}
frame=[]
for page_number in range(1, 20):
http= "https://www.epa.wa.gov.au/media-statements?page={}".format(page_number)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.content, 'html.parser')
for row in soup.select('.view-content .views-row'):
content = row.select_one('.views-field-body').get_text(strip=True, separator='\n')
title = row.select_one('h5 a').get_text(strip=True)
link = 'https://www.epa.wa.gov.au' + row.a['href']
date = row.select_one('.date-display-single').get_text(strip=True)
frame.append({
'title': title,
'link': link,
'date': date,
'content': content
})
dfs = pd.DataFrame(frame)
dfs.to_csv('epa_scrapper.csv',index=False,encoding='utf-8-sig')
创建 epa_scrapper.csv
(来自 LibreOffice 的屏幕截图):
我正在尝试使用 BeautifulSoup 提取 link 的标题。我正在使用的代码如下:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
hdr={'User-Agent':'Chrome/84.0.4147.135'}
frame=[]
for page_number in range(19):
http= "https://www.epa.wa.gov.au/media-statements?page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.content, 'html.parser')
for row in soup.select('.view-content .views-row'):
content = row.select_one('.views-field-body').get_text(strip=True)
title = row.text.strip(':')
link = 'https://www.epa.wa.gov.au' + row.a['href']
date = row.select_one('.date-display-single').get_text(strip=True)
frame.append({
'title': title,
'link': link,
'date': date,
'content': content
})
dfs = pd.DataFrame(frame)
dfs.to_csv('epa_scrapper.csv',index=False,encoding='utf-8-sig')
但是,在我运行上面的代码之后,什么也没有显示。如何提取存储在 link 中的锚标记的 title 属性中的值?
此外,我只想知道如何将“title”、“link”、“dt”、“content”附加到 csv 文件中。
在此先感谢您。
要获取 link 文本,您可以使用选择器 "h5 a"
。例如:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
hdr={'User-Agent':'Chrome/84.0.4147.135'}
frame=[]
for page_number in range(1, 20):
http= "https://www.epa.wa.gov.au/media-statements?page={}".format(page_number)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.content, 'html.parser')
for row in soup.select('.view-content .views-row'):
content = row.select_one('.views-field-body').get_text(strip=True, separator='\n')
title = row.select_one('h5 a').get_text(strip=True)
link = 'https://www.epa.wa.gov.au' + row.a['href']
date = row.select_one('.date-display-single').get_text(strip=True)
frame.append({
'title': title,
'link': link,
'date': date,
'content': content
})
dfs = pd.DataFrame(frame)
dfs.to_csv('epa_scrapper.csv',index=False,encoding='utf-8-sig')
创建 epa_scrapper.csv
(来自 LibreOffice 的屏幕截图):