在自定义中间件上访问 Spider 自身对象
Access Spider self object on custom middleware
我试图注意到我正在删除的页面何时出现问题。如果响应没有有效的状态代码,我想在爬虫统计信息中写入一个自定义值,以便我可以 return 从我的进程中退出一个非零代码。这是我到目前为止写的:
MySpider.py
from spiders.utils.logging_utils import inform_user
class MySpider(Spider):
name = 'MyScrapper'
allowed_domains = ['www.mydomain.es']
start_urls = ['http://www.mydomain/Download.html']
custom_settings = {
"SPIDER_MIDDLEWARES": {
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None
}
}
def parse(self, response):
if response.status != 200:
message = "ERROR {} on request.".format(response.status)
reason = 'Status response not valid'
inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
...
utils/logging_utils.py
def inform_user(self, level, message, close_spider=False, reason=''):
level = level.upper() if isinstance(level, str) else ''
levels = {
'CRITICAL': 50,
'ERROR': 40,
'WARNING': 30,
'INFO': 20,
'DEBUG': 10
}
self.logger.log(levels.get(level, 0), message)
if close_spider:
self.crawler.stats.set_value('custom/failed_job', 'True')
raise ScrapyExceptions.UsageError(reason=reason)
这按预期工作,但我认为删除 HttpErrorMiddleware 不是一个好的做法。这就是为什么我试图编写一个自定义中间件来设置爬虫中的统计信息:
MySpider.py
from spiders.utils.logging_utils import inform_user
class CustomHttpErrorMiddleware(HttpErrorMiddleware):
def process_spider_exception(self, response, exception, spider):
super().process_spider_exception(response, exception, spider)
if response.status != 200:
message = "ERROR {} on request.".format(response.status)
reason = 'Status response not valid'
inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
class MySpider(Spider):
name = 'MyScrapper'
allowed_domains = ['www.mydomain.es']
start_urls = ['http://www.mydomain/Download.html']
custom_settings = {
"SPIDER_MIDDLEWARES": {
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None,
CustomHttpErrorMiddleware: 50
}
}
但是,现在我在中间件定义上调用 inform_user
函数,所以我无法访问 Spider self
对象,它包含 self.logger
和 self.crawler
函数使用的对象。如何使 Spider self
对象在中间件上可用?
蜘蛛self
对象是中间件process_spider_exception
方法中名为spider
的参数。您可以像下面这样使用它
spider.logger.info(...)
我试图注意到我正在删除的页面何时出现问题。如果响应没有有效的状态代码,我想在爬虫统计信息中写入一个自定义值,以便我可以 return 从我的进程中退出一个非零代码。这是我到目前为止写的:
MySpider.py
from spiders.utils.logging_utils import inform_user
class MySpider(Spider):
name = 'MyScrapper'
allowed_domains = ['www.mydomain.es']
start_urls = ['http://www.mydomain/Download.html']
custom_settings = {
"SPIDER_MIDDLEWARES": {
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None
}
}
def parse(self, response):
if response.status != 200:
message = "ERROR {} on request.".format(response.status)
reason = 'Status response not valid'
inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
...
utils/logging_utils.py
def inform_user(self, level, message, close_spider=False, reason=''):
level = level.upper() if isinstance(level, str) else ''
levels = {
'CRITICAL': 50,
'ERROR': 40,
'WARNING': 30,
'INFO': 20,
'DEBUG': 10
}
self.logger.log(levels.get(level, 0), message)
if close_spider:
self.crawler.stats.set_value('custom/failed_job', 'True')
raise ScrapyExceptions.UsageError(reason=reason)
这按预期工作,但我认为删除 HttpErrorMiddleware 不是一个好的做法。这就是为什么我试图编写一个自定义中间件来设置爬虫中的统计信息:
MySpider.py
from spiders.utils.logging_utils import inform_user
class CustomHttpErrorMiddleware(HttpErrorMiddleware):
def process_spider_exception(self, response, exception, spider):
super().process_spider_exception(response, exception, spider)
if response.status != 200:
message = "ERROR {} on request.".format(response.status)
reason = 'Status response not valid'
inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
class MySpider(Spider):
name = 'MyScrapper'
allowed_domains = ['www.mydomain.es']
start_urls = ['http://www.mydomain/Download.html']
custom_settings = {
"SPIDER_MIDDLEWARES": {
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None,
CustomHttpErrorMiddleware: 50
}
}
但是,现在我在中间件定义上调用 inform_user
函数,所以我无法访问 Spider self
对象,它包含 self.logger
和 self.crawler
函数使用的对象。如何使 Spider self
对象在中间件上可用?
蜘蛛self
对象是中间件process_spider_exception
方法中名为spider
的参数。您可以像下面这样使用它
spider.logger.info(...)