从 scrapy spider 获取日志消息并将其分配给变量

Get a log message from scrapy spider and assign it to a variable

我想检查来自此记录器的日志消息:[scrapy.spidermiddlewares.httperror] 并且基于它,该函数将执行特定操作所以基本上我想将消息作为字符串分配给变量,然后在该字符串中找到关键字

documentation 中,我没有找到一种方法来做到这一点,一切都是关于格式化日志

import scrapy

class spider1(scrapy.Spider):
    name = 'spider1'
    allowed_domains = []
    custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 2}
    start_urls = ['https://quotes.toscrape.com/']


    def parse(self, response):
        print(response.text)

日志示例

2022-02-03 03:11:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <402 https://quotes.toscrape.com/>: HTTP status code is not handled or not allowed

我想将上面的日志消息赋值给一个变量

我知道我可以将整个日志输出到一个 .txt 文件,但由于我将在无限循环中有多个蜘蛛 运行,因此将有大量数据需要迭代

您可以使用日志过滤器并将其应用于特定的 scrapy.spidermiddlewares.httperror 记录器。然后,您可以使用正则表达式来捕获您想要过滤的确切错误类型,然后将其写入文件。请参阅下面的示例代码:

import scrapy
import logging
import re

class ContentFilter(logging.Filter):
    def filter(self, record):
        match = re.search(r'Ignoring response <.*> HTTP status code is not handled or not allowed', record.msg)
        if match:
            with open("logged_messages.log", "a") as f:
                f.write(record.msg + '\n')
            return True

class spider1(scrapy.Spider):
    name = 'spider1'
    allowed_domains = []
    custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 2}
    start_urls = ['https://quotes.toscrape.com/']

    def __init__(self, *args, **kwargs):
        logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
        logger.addFilter(ContentFilter())

    def parse(self, response):
        yield {
            "title": response.css("title::text").get()
        }

scrapy docs

中了解有关日志记录模块和您可以进行的自定义的更多信息