使用 Scrapy 和 Crochet 库记录到文件
Logging to a file using Scrapy and Crochet libraries
我是运行Scrapy from scripts, using Crochet库为了屏蔽代码。现在我试图将日志转储到一个文件中,但由于某种原因它开始将日志重定向到 STDOUT
。我怀疑我心目中的Crochet
库,但至今没有任何线索。
- 如何调试此类问题?请与我分享您的调试技巧。
- 如何修复它以便将日志转储到文件中?
import logging
import crochet
import scrapy
from scrapy import crawler
from scrapy.utils import log
crochet.setup()
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for idx, title in enumerate(response.css('.post-header>h2')):
if idx == 10:
return
logging.info({'title': title.css('a ::text').get()})
@crochet.wait_for(timeout=None)
def crawl():
runner = crawler.CrawlerRunner()
deferred = runner.crawl(BlogSpider)
return deferred
log.configure_logging(settings={'LOG_FILE': 'my.log'})
logging.info("Starting...")
crawl()
我看到您在使用 logging.info
登录时设置了 scrapy 的日志设置,这会将日志消息发送到 python 的根记录器而不是 scrapy 的根记录器**。尝试在 spyder 实例中使用 self.logger.info("whatever")
作为 scrapy initializes a logger instance in each object。或使用
为根记录器设置日志处理程序
# optional log formatting
format=logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler = logging.FileHandler('my.log'), mode='w')
file_handler.setFormatter(format)
file_handler.name = 'file_level_handler' # optional
root_logger = logging.getLogger()
root_logger.addHandler(file_handler)
**我读了 scrapy sets (python's) root logger,所以可能是我错了但显然你的设置没有被发送到 python's root logger.
唯一需要做的就是将 log settings
也传递给 CrawlerRunner
。
import logging
import crochet
import scrapy
from scrapy import crawler
from scrapy.utils import log
crochet.setup()
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for idx, title in enumerate(response.css('.post-header>h2')):
if idx == 10:
return
logging.info({'title': title.css('a ::text').get()})
@crochet.wait_for(timeout=None)
def crawl():
runner = crawler.CrawlerRunner(settings=log_settings)
deferred = runner.crawl(BlogSpider)
return deferred
log_settings = {'LOG_FILE': 'my.log'}
log.configure_logging(settings=log_settings)
logging.info("Starting...")
crawl()
我是运行Scrapy from scripts, using Crochet库为了屏蔽代码。现在我试图将日志转储到一个文件中,但由于某种原因它开始将日志重定向到 STDOUT
。我怀疑我心目中的Crochet
库,但至今没有任何线索。
- 如何调试此类问题?请与我分享您的调试技巧。
- 如何修复它以便将日志转储到文件中?
import logging
import crochet
import scrapy
from scrapy import crawler
from scrapy.utils import log
crochet.setup()
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for idx, title in enumerate(response.css('.post-header>h2')):
if idx == 10:
return
logging.info({'title': title.css('a ::text').get()})
@crochet.wait_for(timeout=None)
def crawl():
runner = crawler.CrawlerRunner()
deferred = runner.crawl(BlogSpider)
return deferred
log.configure_logging(settings={'LOG_FILE': 'my.log'})
logging.info("Starting...")
crawl()
我看到您在使用 logging.info
登录时设置了 scrapy 的日志设置,这会将日志消息发送到 python 的根记录器而不是 scrapy 的根记录器**。尝试在 spyder 实例中使用 self.logger.info("whatever")
作为 scrapy initializes a logger instance in each object。或使用
# optional log formatting
format=logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler = logging.FileHandler('my.log'), mode='w')
file_handler.setFormatter(format)
file_handler.name = 'file_level_handler' # optional
root_logger = logging.getLogger()
root_logger.addHandler(file_handler)
**我读了 scrapy sets (python's) root logger,所以可能是我错了但显然你的设置没有被发送到 python's root logger.
唯一需要做的就是将 log settings
也传递给 CrawlerRunner
。
import logging
import crochet
import scrapy
from scrapy import crawler
from scrapy.utils import log
crochet.setup()
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for idx, title in enumerate(response.css('.post-header>h2')):
if idx == 10:
return
logging.info({'title': title.css('a ::text').get()})
@crochet.wait_for(timeout=None)
def crawl():
runner = crawler.CrawlerRunner(settings=log_settings)
deferred = runner.crawl(BlogSpider)
return deferred
log_settings = {'LOG_FILE': 'my.log'}
log.configure_logging(settings=log_settings)
logging.info("Starting...")
crawl()