设置 process_request 和回调参数时，Scrapy 规则不起作用

Question

我有这个 scrapy 规则 CrawlSpider

rules = [
        Rule(LinkExtractor(
                    allow= '/topic/\d+/organize$', 
                    restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
                    ),
           process_request='request_tagPage', callback = "parse_tagPage", follow = True)
    ]

request_tagePage()是指在请求中添加cookie的功能，parse_tagPage()是指解析目标页面的功能。根据 documentation，CrawlSpider 应该使用 request_tagPage 来发出请求，一旦返回响应，它就会调用 parse_tagPage() 来解析它。但是，我意识到当使用 request_tagPage() 时，蜘蛛根本不会调用 parse_tagPage() 。所以在实际代码中，我在request_tagPage中手动添加parse_tagPage()回调函数作为回调，像这样：

def request_tagPage(self, request):
    return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
            headers = self.headers,\
            callback=self.parse_tagPage) # manually add a callback function.

它起作用了，但现在蜘蛛不使用规则来扩展它的爬行。它在抓取来自 start_urls 的链接后关闭。但是，在我手动将 parse_tagPage() 设置为 request_tagPage() 的回调之前，规则有效。所以我想这可能是一个错误？是一种启用 request_tagPage() 的方法，我需要在请求中附加 cookie，parse_tagPage() 用于解析页面和 rules，它指示蜘蛛抓取？

Answer 1

我发现了问题。 CrawlSpider 使用其默认值 parse() 来应用规则。因此，当调用我的自定义 parse_tagPage() 时，不再有 parse() 继续应用规则。解决方案是简单地将默认 parse() 添加到我的自定义 parse_tagPage() 中。现在看起来像这样：

def parse_tagPage(self, response):
    # parse the response, get the information I want...
    # save the information into a local file...
    return self.parse(response) # simply calls the default parse() function to enable the rules

Answer 2

由 CrawlSpider 规则生成的请求使用 internal callbacks and use meta to do their "magic".

我建议您不要在规则的 process_request 挂钩中从头开始重新创建请求（否则您可能会 end-up 重新实现 CrawlSpider 已经为您完成的工作）。

相反，如果您只想添加 cookie 和特殊 headers，您可以使用传递给 request_tagPage 的 .replace() method on the request，这样 "magic" CrawlSpider被保留。

像这样就足够了：

def request_tagPage(self, request):
    tagged = request.replace(headers=self.headers)
    tagged.meta.update(cookiejar=1)
    return tagged

设置 process_request 和回调参数时，Scrapy 规则不起作用

Scrapy rules not working when process_request and callback parameter are set

rules

callback

web-crawler

scrapy