使用 Scrapy 和 Crochet 库记录到文件

Logging to a file using Scrapy and Crochet libraries

我是运行Scrapy from scripts, using Crochet库为了屏蔽代码。现在我试图将日志转储到一个文件中,但由于某种原因它开始将日志重定向到 STDOUT。我怀疑我心目中的Crochet库,但至今没有任何线索。

  1. 如何调试此类问题?请与我分享您的调试技巧。
  2. 如何修复它以便将日志转储到文件中?
import logging

import crochet
import scrapy
from scrapy import crawler
from scrapy.utils import log

crochet.setup()

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for idx, title in enumerate(response.css('.post-header>h2')):
            if idx == 10:
                return
            logging.info({'title': title.css('a ::text').get()})

@crochet.wait_for(timeout=None)
def crawl():
    runner = crawler.CrawlerRunner()
    deferred = runner.crawl(BlogSpider)
    return deferred

log.configure_logging(settings={'LOG_FILE': 'my.log'})
logging.info("Starting...")
crawl()

我看到您在使用 logging.info 登录时设置了 scrapy 的日志设置,这会将日志消息发送到 python 的根记录器而不是 scrapy 的根记录器**。尝试在 spyder 实例中使用 self.logger.info("whatever") 作为 scrapy initializes a logger instance in each object。或使用

为根记录器设置日志处理程序
# optional log formatting
format=logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler = logging.FileHandler('my.log'), mode='w')
file_handler.setFormatter(format)
file_handler.name = 'file_level_handler' # optional
root_logger = logging.getLogger()
root_logger.addHandler(file_handler)

**我读了 scrapy sets (python's) root logger,所以可能是我错了但显然你的设置没有被发送到 python's root logger.

唯一需要做的就是将 log settings 也传递给 CrawlerRunner

import logging

import crochet
import scrapy
from scrapy import crawler
from scrapy.utils import log

crochet.setup()

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for idx, title in enumerate(response.css('.post-header>h2')):
            if idx == 10:
                return
            logging.info({'title': title.css('a ::text').get()})

@crochet.wait_for(timeout=None)
def crawl():
    runner = crawler.CrawlerRunner(settings=log_settings)
    deferred = runner.crawl(BlogSpider)
    return deferred

log_settings = {'LOG_FILE': 'my.log'}
log.configure_logging(settings=log_settings)
logging.info("Starting...")
crawl()