Scrapy:收集重试消息
Scrapy: collect retry messages
根据记录 here,爬虫有最大返回次数。达到那个之后,我得到了类似于以下的错误:
Gave up retrying <GET https:/foo/bar/123> (failed 3 times)
我相信该消息是由代码 here 生成的。
但是,我想总结一下 放弃。具体来说,我想知道是否可以:
- 提取URL的
123
部分(一个ID),把这些ID写到一个单独的文件中。
- 访问原始
request
中的 meta
信息。 This documentation 可能会有帮助。
您可以子类化 scrapy.contrib.downloadermiddleware.retry.RetryMiddleware
并覆盖 _retry()
以对请求做任何您想做的事情,而不是放弃。
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy import log
class CustomRetryMiddleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
log.msg(format="Retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
# do something with the request: inspect request.meta, look at request.url...
log.msg(format="Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
然后就是在您的 settings.py
中引用这个自定义中间件组件了
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
'myproject.middlewares.CustomRetryMiddleware': 500,
}
根据记录 here,爬虫有最大返回次数。达到那个之后,我得到了类似于以下的错误:
Gave up retrying <GET https:/foo/bar/123> (failed 3 times)
我相信该消息是由代码 here 生成的。
但是,我想总结一下 放弃。具体来说,我想知道是否可以:
- 提取URL的
123
部分(一个ID),把这些ID写到一个单独的文件中。 - 访问原始
request
中的meta
信息。 This documentation 可能会有帮助。
您可以子类化 scrapy.contrib.downloadermiddleware.retry.RetryMiddleware
并覆盖 _retry()
以对请求做任何您想做的事情,而不是放弃。
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy import log
class CustomRetryMiddleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
log.msg(format="Retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
# do something with the request: inspect request.meta, look at request.url...
log.msg(format="Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
然后就是在您的 settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
'myproject.middlewares.CustomRetryMiddleware': 500,
}