在Scrapy中养CloseSpider有什么影响?
What impact on raising CloseSpider in Scrapy?
我想知道提高CloseSpider有什么影响。在文档 http://doc.scrapy.org/en/latest/topics/exceptions.html#closespider 中没有关于它的信息。如您所知,scrapy 同时处理几个请求。如果在处理最后一个请求之前引发此异常怎么办?它会等待处理之前生成的休息请求吗?
示例:
def parse(self, response):
my_url = 'http://someurl.com/item/'
for i in range(1, 100):
my_url += str(i)
if i == 50:
raise CloseSpider('')
else:
yield Request(url=my_url, callback=self.my_handler)
def my_handler(self, response):
# handler
感谢您的回复。
========================
可能的解决方案:
is_alive = True
def parse(self, response):
my_url = 'http://url.com/item/'
for i in range(1, 100):
if not is_alive:
break
my_url += str(i)
yield Request(url=my_url, callback=self.my_handler)
def my_handler(self, response):
if (response do not contains new item):
is_alive = False
根据source code,如果出现CloseSpider
异常,将执行engine.close_spider()
方法:
def handle_spider_error(self, _failure, request, response, spider):
exc = _failure.value
if isinstance(exc, CloseSpider):
self.crawler.engine.close_spider(spider, exc.reason or 'cancelled')
return
engine.close_spider()
本身会关闭蜘蛛并 清除所有未完成的请求:
def close_spider(self, spider, reason='cancelled'):
"""Close (cancel) spider and clear all its outstanding requests"""
slot = self.slot
if slot.closing:
return slot.closing
logger.info("Closing spider (%(reason)s)",
{'reason': reason},
extra={'spider': spider})
dfd = slot.close()
# ...
它还会安排 close_spider()
调用 Scrapy 架构的不同组件:下载器、抓取器、调度器等。
我想知道提高CloseSpider有什么影响。在文档 http://doc.scrapy.org/en/latest/topics/exceptions.html#closespider 中没有关于它的信息。如您所知,scrapy 同时处理几个请求。如果在处理最后一个请求之前引发此异常怎么办?它会等待处理之前生成的休息请求吗? 示例:
def parse(self, response):
my_url = 'http://someurl.com/item/'
for i in range(1, 100):
my_url += str(i)
if i == 50:
raise CloseSpider('')
else:
yield Request(url=my_url, callback=self.my_handler)
def my_handler(self, response):
# handler
感谢您的回复。
======================== 可能的解决方案:
is_alive = True
def parse(self, response):
my_url = 'http://url.com/item/'
for i in range(1, 100):
if not is_alive:
break
my_url += str(i)
yield Request(url=my_url, callback=self.my_handler)
def my_handler(self, response):
if (response do not contains new item):
is_alive = False
根据source code,如果出现CloseSpider
异常,将执行engine.close_spider()
方法:
def handle_spider_error(self, _failure, request, response, spider):
exc = _failure.value
if isinstance(exc, CloseSpider):
self.crawler.engine.close_spider(spider, exc.reason or 'cancelled')
return
engine.close_spider()
本身会关闭蜘蛛并 清除所有未完成的请求:
def close_spider(self, spider, reason='cancelled'):
"""Close (cancel) spider and clear all its outstanding requests"""
slot = self.slot
if slot.closing:
return slot.closing
logger.info("Closing spider (%(reason)s)",
{'reason': reason},
extra={'spider': spider})
dfd = slot.close()
# ...
它还会安排 close_spider()
调用 Scrapy 架构的不同组件:下载器、抓取器、调度器等。