当使用 DownloaderMiddleware 处理时，Scrapy 似乎正在对第一个请求进行重复数据删除

Question

我有一个继承自 SitemapSpider 的蜘蛛。正如预期的那样，启动时的第一个请求是 sitemap.xml 我的网站。但是，为了使其正常工作，我需要向所有请求添加 header ，包括获取站点地图的初始请求。我使用 DownloaderMiddleware 这样做，如下所示：

def process_request(self, request: scrapy.http.Request, spider):
    if "Host" in request.headers:
        return None

    host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
    request.headers["Host"] = host
    spider.logger.info(f"Got {request}")
    return request

但是，看起来 Scrapy 的请求重复数据删除器正在阻止该请求通过。在我的日志中，我看到类似这样的内容：

2021-10-16 21:21:08 [ficbook-spider] INFO: Got <GET https://mywebsite.com/sitemap.xml>
2021-10-16 21:21:08 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mywebsite.com/sitemap.xml>

由于process_request中的spider.logger.info只触发了一次，我推测是第一个请求，经过处理后去重。我认为，也许，重复数据删除是在 DownloaderMiddleware 之前触发的（这将解释请求被删除，而日志中没有第二个“Got ...”），但是，我认为这不是真的，原因有两个：

我查看了 SitemapSpider 的代码，它似乎只获取了一次 sitemap.xml
如果确实如此，事实上，之前获取它，我希望它能做一些事情 - 而不是它只是停止蜘蛛，因为没有页面排队等待处理

为什么会这样？我在 process_request 中犯了什么错误吗？

Answer 1

它不会对第一个响应做任何事情，也不会获取第二个响应，因为您正在从自定义 DownloaderMiddleware process_request 函数返回一个新请求，该函数正在被过滤掉。来自文档：

If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

如果您明确表示不过滤您的第二个请求，它可能会起作用。

def process_request(self, request: scrapy.http.Request, spider):
    if "Host" in request.headers:
        return None

    host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
    new_req = request.replace(dont_filter=True)
    new_req.headers["Host"] = host
    spider.logger.info(f"Got {new_req}")
    return new_req

当使用 DownloaderMiddleware 处理时，Scrapy 似乎正在对第一个请求进行重复数据删除

Scrapy appears to be deduplicating the first request when it is processed with DownloaderMiddleware

python

scrapy

scrapy-middleware