如何将由于 503 导致的抓取失败标记为 Scrapy 中的错误?
How to mark scrape failed because of 503 as error in Scrapy?
所以我在抓取时得到了状态 503。它被重试,但随后被忽略。我希望它被标记为错误,而不是被忽略。怎么做?
我更喜欢将它设置在 settings.py
中,这样它就会适用于我所有的蜘蛛程序。 handle_httpstatus_list
似乎只会影响一只蜘蛛。
您应该查看两个设置:
RETRY_HTTP_CODES
:
Default: [500, 502, 503, 504, 408]
Which HTTP response codes to retry. Other errors (DNS lookup issues, connections lost, etc) are always retried.
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#retry-http-codes
和HTTPERROR_ALLOWED_CODES
:
Default: []
Pass all responses with non-200 status codes contained in this list.
https://doc.scrapy.org/en/latest/topics/spider-middleware.html#std:setting-HTTPERROR_ALLOWED_CODES
最后,我覆盖了重试中间件只是为了一个小改动。我设置了每次爬虫放弃重试的时候,不管状态码是什么,它都会被标记为错误。
Scrapy 似乎并没有将放弃重试关联为错误。这对我来说很奇怪。
如果有人想使用它,这就是中间件。不要忘记在 settings.py
上激活它
from scrapy.downloadermiddlewares.retry import *
class Retry500Middleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
# This is the point where I update it. It used to be `logger.debug` instead of `logger.error`
logger.error("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
所以我在抓取时得到了状态 503。它被重试,但随后被忽略。我希望它被标记为错误,而不是被忽略。怎么做?
我更喜欢将它设置在 settings.py
中,这样它就会适用于我所有的蜘蛛程序。 handle_httpstatus_list
似乎只会影响一只蜘蛛。
您应该查看两个设置:
RETRY_HTTP_CODES
:
Default: [500, 502, 503, 504, 408]
Which HTTP response codes to retry. Other errors (DNS lookup issues, connections lost, etc) are always retried.
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#retry-http-codes
和HTTPERROR_ALLOWED_CODES
:
Default: []
Pass all responses with non-200 status codes contained in this list.
https://doc.scrapy.org/en/latest/topics/spider-middleware.html#std:setting-HTTPERROR_ALLOWED_CODES
最后,我覆盖了重试中间件只是为了一个小改动。我设置了每次爬虫放弃重试的时候,不管状态码是什么,它都会被标记为错误。
Scrapy 似乎并没有将放弃重试关联为错误。这对我来说很奇怪。
如果有人想使用它,这就是中间件。不要忘记在 settings.py
from scrapy.downloadermiddlewares.retry import *
class Retry500Middleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
# This is the point where I update it. It used to be `logger.debug` instead of `logger.error`
logger.error("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})