使用 BeautifulSoup 解析日期
Parsing date with BeautifulSoup
我使用 BeautifulSoup 从 page 获取信息并且我获得了 link:
[<span class="field-content">Friday, September 11, 2015</span>]
使用命令
links = soup.find_all('div', attrs={'class':'views-row'})
link = links[0]
link.find('span', attrs={'class':'views-field views-field-created'}).select('span')
但我需要解析日期。我怎样才能从中得到 Friday, September 11, 2015
?
找到了,是link.find('span', attrs={'class':'views-field views-field-created'}).select_one('span').text
回答问题中的示例 - 从结果集中选择最后一个元素:
link.find('span', attrs={'class':'views-field views-field-created'}).select('span')[-1].text
或更短:
link.find_all("span")[-1].text
但如果您想提取所有信息并存储为结构化数据,使用 stripped_strings
会是更好的方法。
例子
import requests
from bs4 import BeautifulSoup
url = 'https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
data = []
for item in soup.select('.view-content div'):
c = list(item.stripped_strings)
data.append({
'title':c[0],
'date':c[-1],
'url':item.a['href'].split('/',3)[-1]
})
print(data)
输出
[{'title': 'Kicks offs, sing offs, and pro ams', 'date': 'Friday, September 11, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/kicks-offs-sing-offs-and-pro-ams'}, {'title': 'Grand Finale of the Hampton Classic Horse Show', 'date': 'Tuesday, September 1, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/grand-finale-of-the-hampton-classic-horse-show'}, {'title': 'Riders, Spectators, Horses, and More ...', 'date': 'Wednesday, August 26, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/riders-spectators-horses-and-more'}, {'title': 'Artist and Writers (and Designers)', 'date': 'Thursday, August 20, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/artist-and-writers-and-designers'}, {'title': 'Garden Parties Kickoffs and Summer Benefits', 'date': 'Monday, August 17, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/garden-parties-kickoffs-and-summer-benefits'}, {'title': 'The Summer Set', 'date': 'Wednesday, August 12, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/the-summer-set'}, {'title': 'Midsummer Parties', 'date': 'Wednesday, August 5, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/midsummer-parties'}, {'title': 'The Watermill Center and The Parrish', 'date': 'Wednesday, July 29, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/the-watermill-center-and-the-parrish'}, {'title': 'Unconditional Love', 'date': 'Thursday, July 23, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/unconditional-love'}, {'title': "Women's Health, Boys & Girls, Cancer Research, and Just Plain Summer Fun", 'date': 'Friday, July 17, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/womens-health-boys-girls-cancer-research-and-just-plain-summer-fun'},...]
我使用 BeautifulSoup 从 page 获取信息并且我获得了 link:
[<span class="field-content">Friday, September 11, 2015</span>]
使用命令
links = soup.find_all('div', attrs={'class':'views-row'})
link = links[0]
link.find('span', attrs={'class':'views-field views-field-created'}).select('span')
但我需要解析日期。我怎样才能从中得到 Friday, September 11, 2015
?
找到了,是link.find('span', attrs={'class':'views-field views-field-created'}).select_one('span').text
回答问题中的示例 - 从结果集中选择最后一个元素:
link.find('span', attrs={'class':'views-field views-field-created'}).select('span')[-1].text
或更短:
link.find_all("span")[-1].text
但如果您想提取所有信息并存储为结构化数据,使用 stripped_strings
会是更好的方法。
例子
import requests
from bs4 import BeautifulSoup
url = 'https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
data = []
for item in soup.select('.view-content div'):
c = list(item.stripped_strings)
data.append({
'title':c[0],
'date':c[-1],
'url':item.a['href'].split('/',3)[-1]
})
print(data)
输出
[{'title': 'Kicks offs, sing offs, and pro ams', 'date': 'Friday, September 11, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/kicks-offs-sing-offs-and-pro-ams'}, {'title': 'Grand Finale of the Hampton Classic Horse Show', 'date': 'Tuesday, September 1, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/grand-finale-of-the-hampton-classic-horse-show'}, {'title': 'Riders, Spectators, Horses, and More ...', 'date': 'Wednesday, August 26, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/riders-spectators-horses-and-more'}, {'title': 'Artist and Writers (and Designers)', 'date': 'Thursday, August 20, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/artist-and-writers-and-designers'}, {'title': 'Garden Parties Kickoffs and Summer Benefits', 'date': 'Monday, August 17, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/garden-parties-kickoffs-and-summer-benefits'}, {'title': 'The Summer Set', 'date': 'Wednesday, August 12, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/the-summer-set'}, {'title': 'Midsummer Parties', 'date': 'Wednesday, August 5, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/midsummer-parties'}, {'title': 'The Watermill Center and The Parrish', 'date': 'Wednesday, July 29, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/the-watermill-center-and-the-parrish'}, {'title': 'Unconditional Love', 'date': 'Thursday, July 23, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/unconditional-love'}, {'title': "Women's Health, Boys & Girls, Cancer Research, and Just Plain Summer Fun", 'date': 'Friday, July 17, 2015', 'url': 'http://www.newyorksocialdiary.com/party-pictures/2015/womens-health-boys-girls-cancer-research-and-just-plain-summer-fun'},...]