python scrapy 从网站提取数据

python scrapy extract data from website

我想从 this page 抓取数据。这是我当前的代码:

buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, "http://www.guardalo.org/99407/")
c.setopt(c.VERBOSE, 0)
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 15)
c.setopt(c.TIMEOUT, 15)
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.perform()
body = buf.getvalue()
c.close()

response = HtmlResponse(url='http://www.guardalo.org/99407/', body=body)
print Selector(response=response).xpath('//edindex/text()').extract()

它有效,但我需要将标题、视频 link 和描述作为单独的变量。我怎样才能做到这一点?

可以使用 //title/text() 提取标题,视频源 link 通过 //video/source/@src:

selector = Selector(response=response)

title = selector.xpath('//title/text()').extract()[0]
description = selector.xpath('//edindex/text()').extract()
video_sources = selector.xpath('//video/source/@src').extract()[0]

code_url = selector.xpath('//meta[@name="EdImage"]/@content').extract()[0]
code = re.search(r'(\w+)-play-small.jpg$', code_url).group(1)

print title
print description
print video_sources
print code

打印:

Best Babies Laughing Video Compilation 2012 [HD] - Guardalo
[u'Best Babies Laughing Video Compilation 2012 [HD]', u"Ciao a tutti amici di guardalo,quello che propongo oggi \xe8 un video sui neonati buffi con risate travolgenti, facce molto buffe,iniziamo con una coppia di gemellini che se la ridono fra loro,per passare subito con una biondina che si squaqqera dalle risate al suono dello strappo della carta ed \xe8 solo l'inizio.", u'\r\nBuone risate a tutti', u'Elia ride', u'Funny Triplet Babies Laughing Compilation 2014 [NEW HD]', u'Real Talent Little girl Singing Listen by Beyonce .', u'Bimbo Napoletano alle Prese con il Distributore di Benzina', u'Telecamera nascosta al figlio guardate che fa,video bambini divertenti,video bambini divertentissimi']
http://static.guardalo.org/video_image/pre-roll-guardalo.mp4
L49VXZwfup8

不需要 scrapy 来获取 single-URL -- 只需使用更简单的工具(甚至是最简单的 urllib.urlopen(theurl).read()!)获取单个页面的 HTML,然后分析HTML 例如 BeautifulSoup。从一个简单的 "view source" 看来,您正在寻找:

<title>Best Babies Laughing Video Compilation 2012 [HD] - Guardalo</title>

(标题),三者之一:

<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.mp4" type='video/mp4'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.webm" type='video/webm'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.ogv" type='video/ogg'>

(视频 linkS,复数,我无法选择一个,因为你没有告诉我们你喜欢哪种格式!-),以及

<meta name="description" content="Ciao a tutti amici di guardalo,quello che propongo oggi è un video sui neonati buffi con risate" />

(描述)。 BeautifulSoup 使得获取每一个变得非常简单,例如在需要的导入之后

html = urllib.urlopen('http://www.guardalo.org/99407/').read()
soup = BeautifulSoup(html)
title = soup.find('title').text

等等(但你必须选择一个视频 link -- 我在他们的来源中看到他们被称为 "pre-rolls" 所以可能 links 到实际 non-ads 视频实际上 不是 在页面上,但只能在 log-in 或其他内容之后访问。