CrawlerProcess 中的多个蜘蛛 - 如何获取每个蜘蛛的日志?
Multiple Spiders in CrawlerProcess - How to get a log per spider?
场景:
- 具有多个蜘蛛的单个 scrapy 项目。
- 所有蜘蛛运行从脚本中聚集在一起。
问题:
- 同一命名空间中的所有日志消息。不可能知道哪个消息属于哪个蜘蛛。
在scrapy 0.24中,我在一个脚本中有多个蜘蛛运行,我得到了一个日志文件,其中包含与其蜘蛛相关的消息,类似于:
2015-09-30 22:55:12-0400 [scrapy] INFO: Scrapy 0.24.5 started (bot: mybot)
2015-09-30 22:55:12-0400 [scrapy] DEBUG: Enabled extensions: LogStats, ...
2015-09-30 21:55:12-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, ...
2015-09-30 21:55:12-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, ...
2015-09-30 21:55:12-0500 [scrapy] DEBUG: Enabled item pipelines: MybotPipeline
2015-09-30 21:55:12-0500 [spider1] INFO: Spider opened
2015-09-30 21:55:12-0500 [spider1] INFO: Crawled 0 pages ...
2015-09-30 21:55:12-0500 [spider2] INFO: Spider opened
2015-09-30 21:55:12-0500 [spider2] INFO: Crawled 0 pages ...
2015-09-30 21:55:12-0500 [spider3] INFO: Spider opened
2015-09-30 21:55:12-0500 [spider3] INFO: Crawled 0 pages ...
2015-09-30 21:55:13-0500 [spider2] DEBUG: Crawled (200) <GET ...
2015-09-30 21:55:13-0500 [spider3] DEBUG: Crawled (200) <GET ...
2015-09-30 21:55:13-0500 [spider1] DEBUG: Crawled (200) <GET ...
2015-09-30 21:55:13-0500 [spider1] INFO: Closing spider (finished)
2015-09-30 21:55:13-0500 [spider1] INFO: Dumping Scrapy stats: ...
2015-09-30 21:55:13-0500 [spider3 INFO: Closing spider (finished)
2015-09-30 21:55:13-0500 [spider3] INFO: Dumping Scrapy stats: ...
2015-09-30 21:55:13-0500 [spider2] INFO: Closing spider (finished)
2015-09-30 21:55:13-0500 [spider2] INFO: Dumping Scrapy stats: ...
有了这个日志文件,我可以在需要时 运行 grep spiderX logfile.txt
获取与某个特定蜘蛛相关的日志。但是现在,在 scrapy 1.0 中,我得到:
2015-09-30 21:55:12-0500 [scrapy] INFO: Spider opened
2015-09-30 21:55:12-0500 [scrapy] INFO: Crawled 0 pages ...
2015-09-30 21:55:12-0500 [scrapy] INFO: Spider opened
2015-09-30 21:55:12-0500 [scrapy] INFO: Crawled 0 pages ...
2015-09-30 21:55:12-0500 [scrapy] INFO: Spider opened
2015-09-30 21:55:12-0500 [scrapy] INFO: Crawled 0 pages ...
显然不可能知道哪个消息属于每个蜘蛛。
问题是:有没有办法让之前的行为有?
也可以为每个蜘蛛创建不同的日志文件。 [1]
但是不可能使用 custom_settings
覆盖蜘蛛中的日志文件。 [2]
那么,有什么办法可以让每个蜘蛛都有不同的日志文件吗?
[1] Scrapy Project with Multiple Spiders - Custom Settings Ignored
[2] https://github.com/scrapy/scrapy/issues/1612
我刚刚发现这是一个已知的 "bug": https://github.com/scrapy/scrapy/issues/1576
已知解决方案:将utils.log.TopLevelFormatter.filter
更改为
def filter(self, record):
if hasattr(record, 'spider'):
record.name = record.spider.name
elif any(record.name.startswith(l + '.') for l in self.loggers):
record.name = record.name.split('.', 1)[0]
return True
@Djunzu 的回答不够容易适用。所以我努力完善它。
# -*- coding: utf-8 -*-
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging, _get_handler, TopLevelFormatter
import datetime
import logging
import time
class MyTopLevelFormatter(TopLevelFormatter):
def __init__(self, loggers=None, name=None):
super(CurrencyTopLevelFormatter, self).__init__()
self.loggers = loggers or []
self.name = name
def filter(self, record):
if self.name in record.name: return False
if hasattr(record, 'spider'):
if record.spider.name != self.name: return False
record.name = record.spider.name + "." + record.name
elif hasattr(record, 'crawler') and hasattr(record.crawler, 'spidercls'):
if record.crawler.spidercls.name != self.name: return False
record.name = record.crawler.spidercls.name + "." + record.name
elif any(record.name.startswith(l + '.') for l in self.loggers):
record.name = record.name.split('.', 1)[0]
return True
def log_init(name):
now = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d-%H-%M-%S')
configure_logging({'LOG_FILE' : "../logs/{0}_{1}.log".format(name,now)}, install_root_handler=False)
settings = get_project_settings()
settings['LOG_FILE'] = "../logs/{0}_{1}.log".format(name,now)
settings['DISABLE_TOPLEVELFORMATTER'] = True
handler = _get_handler(settings)
handler.addFilter(MyTopLevelFormatter(loggers=[__name__], name=name))
# handler.addFilter(TopLevelFormatter(loggers=[__name__]))
logging.root.addHandler(handler)
然后在你的蜘蛛中,这样做:
class MySpider(scrapy.Spider):
#省略
def __init__(self):
log_init(self.name)
场景:
- 具有多个蜘蛛的单个 scrapy 项目。
- 所有蜘蛛运行从脚本中聚集在一起。
问题:
- 同一命名空间中的所有日志消息。不可能知道哪个消息属于哪个蜘蛛。
在scrapy 0.24中,我在一个脚本中有多个蜘蛛运行,我得到了一个日志文件,其中包含与其蜘蛛相关的消息,类似于:
2015-09-30 22:55:12-0400 [scrapy] INFO: Scrapy 0.24.5 started (bot: mybot)
2015-09-30 22:55:12-0400 [scrapy] DEBUG: Enabled extensions: LogStats, ...
2015-09-30 21:55:12-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, ...
2015-09-30 21:55:12-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, ...
2015-09-30 21:55:12-0500 [scrapy] DEBUG: Enabled item pipelines: MybotPipeline
2015-09-30 21:55:12-0500 [spider1] INFO: Spider opened
2015-09-30 21:55:12-0500 [spider1] INFO: Crawled 0 pages ...
2015-09-30 21:55:12-0500 [spider2] INFO: Spider opened
2015-09-30 21:55:12-0500 [spider2] INFO: Crawled 0 pages ...
2015-09-30 21:55:12-0500 [spider3] INFO: Spider opened
2015-09-30 21:55:12-0500 [spider3] INFO: Crawled 0 pages ...
2015-09-30 21:55:13-0500 [spider2] DEBUG: Crawled (200) <GET ...
2015-09-30 21:55:13-0500 [spider3] DEBUG: Crawled (200) <GET ...
2015-09-30 21:55:13-0500 [spider1] DEBUG: Crawled (200) <GET ...
2015-09-30 21:55:13-0500 [spider1] INFO: Closing spider (finished)
2015-09-30 21:55:13-0500 [spider1] INFO: Dumping Scrapy stats: ...
2015-09-30 21:55:13-0500 [spider3 INFO: Closing spider (finished)
2015-09-30 21:55:13-0500 [spider3] INFO: Dumping Scrapy stats: ...
2015-09-30 21:55:13-0500 [spider2] INFO: Closing spider (finished)
2015-09-30 21:55:13-0500 [spider2] INFO: Dumping Scrapy stats: ...
有了这个日志文件,我可以在需要时 运行 grep spiderX logfile.txt
获取与某个特定蜘蛛相关的日志。但是现在,在 scrapy 1.0 中,我得到:
2015-09-30 21:55:12-0500 [scrapy] INFO: Spider opened
2015-09-30 21:55:12-0500 [scrapy] INFO: Crawled 0 pages ...
2015-09-30 21:55:12-0500 [scrapy] INFO: Spider opened
2015-09-30 21:55:12-0500 [scrapy] INFO: Crawled 0 pages ...
2015-09-30 21:55:12-0500 [scrapy] INFO: Spider opened
2015-09-30 21:55:12-0500 [scrapy] INFO: Crawled 0 pages ...
显然不可能知道哪个消息属于每个蜘蛛。
问题是:有没有办法让之前的行为有?
也可以为每个蜘蛛创建不同的日志文件。 [1]
但是不可能使用 custom_settings
覆盖蜘蛛中的日志文件。 [2]
那么,有什么办法可以让每个蜘蛛都有不同的日志文件吗?
[1] Scrapy Project with Multiple Spiders - Custom Settings Ignored
[2] https://github.com/scrapy/scrapy/issues/1612
我刚刚发现这是一个已知的 "bug": https://github.com/scrapy/scrapy/issues/1576
已知解决方案:将utils.log.TopLevelFormatter.filter
更改为
def filter(self, record):
if hasattr(record, 'spider'):
record.name = record.spider.name
elif any(record.name.startswith(l + '.') for l in self.loggers):
record.name = record.name.split('.', 1)[0]
return True
@Djunzu 的回答不够容易适用。所以我努力完善它。
# -*- coding: utf-8 -*-
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging, _get_handler, TopLevelFormatter
import datetime
import logging
import time
class MyTopLevelFormatter(TopLevelFormatter):
def __init__(self, loggers=None, name=None):
super(CurrencyTopLevelFormatter, self).__init__()
self.loggers = loggers or []
self.name = name
def filter(self, record):
if self.name in record.name: return False
if hasattr(record, 'spider'):
if record.spider.name != self.name: return False
record.name = record.spider.name + "." + record.name
elif hasattr(record, 'crawler') and hasattr(record.crawler, 'spidercls'):
if record.crawler.spidercls.name != self.name: return False
record.name = record.crawler.spidercls.name + "." + record.name
elif any(record.name.startswith(l + '.') for l in self.loggers):
record.name = record.name.split('.', 1)[0]
return True
def log_init(name):
now = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d-%H-%M-%S')
configure_logging({'LOG_FILE' : "../logs/{0}_{1}.log".format(name,now)}, install_root_handler=False)
settings = get_project_settings()
settings['LOG_FILE'] = "../logs/{0}_{1}.log".format(name,now)
settings['DISABLE_TOPLEVELFORMATTER'] = True
handler = _get_handler(settings)
handler.addFilter(MyTopLevelFormatter(loggers=[__name__], name=name))
# handler.addFilter(TopLevelFormatter(loggers=[__name__]))
logging.root.addHandler(handler)
然后在你的蜘蛛中,这样做:
class MySpider(scrapy.Spider): #省略
def __init__(self):
log_init(self.name)