在自定义中间件上访问 Spider 自身对象

Access Spider self object on custom middleware

我试图注意到我正在删除的页面何时出现问题。如果响应没有有效的状态代码,我想在爬虫统计信息中写入一个自定义值,以便我可以 return 从我的进程中退出一个非零代码。这是我到目前为止写的:

MySpider.py

from spiders.utils.logging_utils import inform_user

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.mydomain.es']
    start_urls = ['http://www.mydomain/Download.html']
    custom_settings = {
        "SPIDER_MIDDLEWARES": {
            "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None
        }
    }

    def parse(self, response):
        if response.status != 200:
            message = "ERROR {} on request.".format(response.status)
            reason = 'Status response not valid'
            inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
        ...

utils/logging_utils.py

def inform_user(self, level, message, close_spider=False, reason=''):
    level = level.upper() if isinstance(level, str) else ''
    levels = {
        'CRITICAL': 50,
        'ERROR': 40,
        'WARNING': 30,
        'INFO': 20,
        'DEBUG': 10
    }
    self.logger.log(levels.get(level, 0), message)
    if close_spider:
        self.crawler.stats.set_value('custom/failed_job', 'True')
        raise ScrapyExceptions.UsageError(reason=reason)

这按预期工作,但我认为删除 HttpErrorMiddleware 不是一个好的做法。这就是为什么我试图编写一个自定义中间件来设置爬虫中的统计信息:

MySpider.py

from spiders.utils.logging_utils import inform_user

class CustomHttpErrorMiddleware(HttpErrorMiddleware):    
    def process_spider_exception(self, response, exception, spider):
        super().process_spider_exception(response, exception, spider)

        if response.status != 200:
            message = "ERROR {} on request.".format(response.status)
            reason = 'Status response not valid'
            inform_user(self, 'ERROR', message, close_spider=True, reason=reason)

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.mydomain.es']
    start_urls = ['http://www.mydomain/Download.html']
    custom_settings = {
        "SPIDER_MIDDLEWARES": {
            "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None,
            CustomHttpErrorMiddleware: 50
        }
    }

但是,现在我在中间件定义上调用 inform_user 函数,所以我无法访问 Spider self 对象,它包含 self.loggerself.crawler 函数使用的对象。如何使 Spider self 对象在中间件上可用?

蜘蛛self对象是中间件process_spider_exception方法中名为spider的参数。您可以像下面这样使用它 spider.logger.info(...)