scrapy 爬虫抓取实例数据异常
scrapy crawler caught exception reading instance data
我是python的新手,想用scrapy搭建一个网络爬虫。我在 http://blog.siliconstraits.vn/building-web-crawler-scrapy/ 中完成了教程。蜘蛛代码喜欢以下内容:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from nettuts.items import NettutsItem
from scrapy.http import Request
class MySpider(BaseSpider):
name = "nettuts"
allowed_domains = ["net.tutsplus.com"]
start_urls = ["http://net.tutsplus.com/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//h1[@class="post_title"]/a/text()').extract()
for title in titles:
item = NettutsItem()
item["title"] = title
yield item
用命令行启动爬虫:scrapy crawl nettus,出现如下错误:
[boto] DEBUG: Retrieving credentials from metadata server.
2015-07-05 18:27:17 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/anaconda/lib/python2.7/site-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/anaconda/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/anaconda/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/anaconda/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/anaconda/lib/python2.7/urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/anaconda/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 65] No route to host>
2015-07-05 18:27:17 [boto] ERROR: Unable to read instance data, giving up
真不知道怎么了。希望有人能帮忙
重要信息是:
URLError: <urlopen error [Errno 65] No route to host>
这是在告诉您,您的计算机不知道如何与您要抓取的网站进行通信。您是否能够从您尝试 运行 此 python 上的计算机正常访问站点(即在网络浏览器中)?
在settings.py文件中:添加以下代码设置:
DOWNLOAD_HANDLERS = {'s3': None,}
我是python的新手,想用scrapy搭建一个网络爬虫。我在 http://blog.siliconstraits.vn/building-web-crawler-scrapy/ 中完成了教程。蜘蛛代码喜欢以下内容:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from nettuts.items import NettutsItem
from scrapy.http import Request
class MySpider(BaseSpider):
name = "nettuts"
allowed_domains = ["net.tutsplus.com"]
start_urls = ["http://net.tutsplus.com/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//h1[@class="post_title"]/a/text()').extract()
for title in titles:
item = NettutsItem()
item["title"] = title
yield item
用命令行启动爬虫:scrapy crawl nettus,出现如下错误:
[boto] DEBUG: Retrieving credentials from metadata server.
2015-07-05 18:27:17 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/anaconda/lib/python2.7/site-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/anaconda/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/anaconda/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/anaconda/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/anaconda/lib/python2.7/urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/anaconda/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 65] No route to host>
2015-07-05 18:27:17 [boto] ERROR: Unable to read instance data, giving up
真不知道怎么了。希望有人能帮忙
重要信息是:
URLError: <urlopen error [Errno 65] No route to host>
这是在告诉您,您的计算机不知道如何与您要抓取的网站进行通信。您是否能够从您尝试 运行 此 python 上的计算机正常访问站点(即在网络浏览器中)?
在settings.py文件中:添加以下代码设置:
DOWNLOAD_HANDLERS = {'s3': None,}