Scrapy 将链式请求合并为一个

Scrapy merging chained requests into one

我有一个场景,我正在浏览一家商店,浏览 10 多页。然后当我找到我想要的商品时,我会把它加入购物车。

最后我要结账了。问题是,通过 scrapy 链接,它想要检查购物篮的次数与我购物篮中的物品一样多。

如何将链接的请求合并为一个,以便在将 10 件商品添加到购物车后,仅调用一次结帐?

def start_requests(self):
    params = getShopList()
    for param in params:
        yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
                                 method='POST', formdata=param)


def addToBasket(self, response):
    yield scrapy.FormRequest('https://foo.bar/addToBasket', callback=self.checkoutBasket,
                             method='POST',
                             formdata=param)

def checkoutBasket(self, response):
    yield scrapy.FormRequest('https://foo.bar/checkout', callback=self.final, method='POST',
                             formdata=param)

def final(self):
    print("Success, you have purchased 59 items")

编辑:

我试图在关闭事件中发出请求,但它没有运行到请求中,也没有回调..

  def closed(self, reason):
        if reason == "finished":
            print("spider finished")
            return scrapy.Request('https://www.google.com', callback=self.finalmethod)
        print("Spider closed but not finished.")

    def finalmethod(self, response):
        print("finalized")

我想你可以在蜘蛛完成后手动结帐:

def closed(self, reason):
    if reason == "finished":
        return requests.post(checkout_url, data=param)
    print("Spider closed but not finished.")

参见closed

更新

class MySpider(scrapy.Spider):
    name = 'whatever'

    def start_requests(self):
        params = getShopList()
        for param in params:
            yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
                                     method='POST', formdata=param)


    def addToBasket(self, response):
        yield scrapy.FormRequest('https://foo.bar/addToBasket',
                                 method='POST', formdata=param)

    def closed(self, reason):
        if reason == "finished":
            return requests.post(checkout_url, data=param)
        print("Spider closed but not finished.")

我使用 Scrapy 信号和 spider_idle 调用解决了它。

Sent when a spider has gone idle, which means the spider has no further:

  • 等待下载的请求数
  • 请求已安排的项目
  • 已在项目管道中处理

https://doc.scrapy.org/en/latest/topics/signals.html

from scrapy import signals, Spider

class MySpider(scrapy.Spider):
    name = 'whatever'

    def start_requests(self):
        self.crawler.signals.connect(self.spider_idle, signals.spider_idle) ## notice this
        params = getShopList()
        for param in params:
            yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
                                     method='POST', formdata=param)


    def addToBasket(self, response):
        yield scrapy.FormRequest('https://foo.bar/addToBasket',
                                 method='POST', formdata=param)

    def spider_idle(self, spider): ## when all requests are finished, this is called
        req = scrapy.Request('https://foo.bar/checkout', callback=self.checkoutFinished)
        self.crawler.engine.crawl(req, spider)

    def checkoutFinished(self, response):
        print("Checkout finished")