从 span 标签中获取日期
Get the date from span tags
使用 Beautiful Soup,我想从包含 url 列表的文本文件中提取日期。其中日期在带有 div class = update 的 span 标签中定义。当我尝试使用下面的代码时,我得到的结果是 <span id="time"></span>
但不是确切的时间。请 help.for 例如 sabah_url.txt 中的链接类型是“http://www.dailysabah.com/world/2012/02/20/seeking-international-support-to-block-assad”
from cookielib import CookieJar
import urllib2
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
try:
url_file = open('sabah_url.txt', 'r')
for line in url_file:
print line
#Opens each extracted URL with urllib2 library
data = urllib2.urlopen(line).read()
soup = BeautifulSoup(data)
#Extracts all the dates of URLs ith its respective class as defined
date = soup.find_all('span', {'id': 'time'})
for item in date:
print item
except BaseException, e:
print 'failed', str(e)
pass
假设您打算获取 发布日期 ,您可以从 meta
标签中获取它:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.dailysabah.com/world/2012/02/20/seeking-international-support-to-block-assad"
data = urllib2.urlopen(url)
soup = BeautifulSoup(data)
print soup.find('meta', itemprop='datePublished', content=True)['content']
打印 2012-02-20T17:41:01Z
.
要使其看起来像 "February 20, 2012",您可以使用 python-dateutil
模块:
>>> from dateutil import parser
>>> s = "2012-02-20T17:41:01Z"
>>> parser.parse(s)
datetime.datetime(2012, 2, 20, 17, 41, 1, tzinfo=tzutc())
>>> parser.parse(s).strftime('%B %d, %Y')
'February 20, 2012'
使用 Beautiful Soup,我想从包含 url 列表的文本文件中提取日期。其中日期在带有 div class = update 的 span 标签中定义。当我尝试使用下面的代码时,我得到的结果是 <span id="time"></span>
但不是确切的时间。请 help.for 例如 sabah_url.txt 中的链接类型是“http://www.dailysabah.com/world/2012/02/20/seeking-international-support-to-block-assad”
from cookielib import CookieJar
import urllib2
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
try:
url_file = open('sabah_url.txt', 'r')
for line in url_file:
print line
#Opens each extracted URL with urllib2 library
data = urllib2.urlopen(line).read()
soup = BeautifulSoup(data)
#Extracts all the dates of URLs ith its respective class as defined
date = soup.find_all('span', {'id': 'time'})
for item in date:
print item
except BaseException, e:
print 'failed', str(e)
pass
假设您打算获取 发布日期 ,您可以从 meta
标签中获取它:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.dailysabah.com/world/2012/02/20/seeking-international-support-to-block-assad"
data = urllib2.urlopen(url)
soup = BeautifulSoup(data)
print soup.find('meta', itemprop='datePublished', content=True)['content']
打印 2012-02-20T17:41:01Z
.
要使其看起来像 "February 20, 2012",您可以使用 python-dateutil
模块:
>>> from dateutil import parser
>>> s = "2012-02-20T17:41:01Z"
>>> parser.parse(s)
datetime.datetime(2012, 2, 20, 17, 41, 1, tzinfo=tzutc())
>>> parser.parse(s).strftime('%B %d, %Y')
'February 20, 2012'