如何在 Scrapy 中重试 IndexError

Question

有时我会收到 IndexError，因为我只成功抓取了一半页面，导致解析逻辑出现 IndexError。出现 IndexError 时如何重试？

它是一个理想的中间件，因此它可以同时处理多个蜘蛛。

Answer 1

如果您假设遇到错误需要重新加载页面，您可以尝试：

max_retries = 5

def parse(self, response):
    # to avoid getting stuck in a loop only retry x times
    retry_count = response.meta.get('retry_count', 0)

    item = {}
    try:
        item['foo'] = response.xpath()[123]
        ...
    except IndexError as e:
        if retry_count == max_retries:
            print(f'max retries reached for {response.url}: {e}')
            return
        yield Request(
            response.url, 
            dont_filter=True, 
            meta={'retry_count': retry_count+1}
        )

Answer 2

最后，我使用了装饰器，并在装饰器函数中从 RetryMiddleware 调用了 _retry() 函数。它运作良好。这不是最好的，最好能够有一个中间件来处理它。不过聊胜于无。

from scrapy.downloadermiddlewares.retry import RetryMiddleware

def handle_exceptions(function):
    def parse_wrapper(spider, response):
        try:
            for result in function(spider, response):
                yield result
        except IndexError as e:
            logging.log(logging.ERROR, "Debug HTML parsing error: %s" % (unicode(response.body, 'utf-8')))
            RM = RetryMiddleware(spider.settings)
            yield RM._retry(response.request, e, spider)
    return parse_wrapper

然后我这样使用装饰器：

@handle_exceptions
def parse(self, response):

如何在 Scrapy 中重试 IndexError

How to retry IndexError in Scrapy

python

scrapy

web-scraping

python-2.7

scrapy-middleware