Scrapy 将链式请求合并为一个
Scrapy merging chained requests into one
我有一个场景,我正在浏览一家商店,浏览 10 多页。然后当我找到我想要的商品时,我会把它加入购物车。
最后我要结账了。问题是,通过 scrapy 链接,它想要检查购物篮的次数与我购物篮中的物品一样多。
如何将链接的请求合并为一个,以便在将 10 件商品添加到购物车后,仅调用一次结帐?
def start_requests(self):
params = getShopList()
for param in params:
yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
method='POST', formdata=param)
def addToBasket(self, response):
yield scrapy.FormRequest('https://foo.bar/addToBasket', callback=self.checkoutBasket,
method='POST',
formdata=param)
def checkoutBasket(self, response):
yield scrapy.FormRequest('https://foo.bar/checkout', callback=self.final, method='POST',
formdata=param)
def final(self):
print("Success, you have purchased 59 items")
编辑:
我试图在关闭事件中发出请求,但它没有运行到请求中,也没有回调..
def closed(self, reason):
if reason == "finished":
print("spider finished")
return scrapy.Request('https://www.google.com', callback=self.finalmethod)
print("Spider closed but not finished.")
def finalmethod(self, response):
print("finalized")
我想你可以在蜘蛛完成后手动结帐:
def closed(self, reason):
if reason == "finished":
return requests.post(checkout_url, data=param)
print("Spider closed but not finished.")
参见closed。
更新
class MySpider(scrapy.Spider):
name = 'whatever'
def start_requests(self):
params = getShopList()
for param in params:
yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
method='POST', formdata=param)
def addToBasket(self, response):
yield scrapy.FormRequest('https://foo.bar/addToBasket',
method='POST', formdata=param)
def closed(self, reason):
if reason == "finished":
return requests.post(checkout_url, data=param)
print("Spider closed but not finished.")
我使用 Scrapy 信号和 spider_idle
调用解决了它。
Sent when a spider has gone idle, which means the spider has no
further:
- 等待下载的请求数
- 请求已安排的项目
- 已在项目管道中处理
https://doc.scrapy.org/en/latest/topics/signals.html
from scrapy import signals, Spider
class MySpider(scrapy.Spider):
name = 'whatever'
def start_requests(self):
self.crawler.signals.connect(self.spider_idle, signals.spider_idle) ## notice this
params = getShopList()
for param in params:
yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
method='POST', formdata=param)
def addToBasket(self, response):
yield scrapy.FormRequest('https://foo.bar/addToBasket',
method='POST', formdata=param)
def spider_idle(self, spider): ## when all requests are finished, this is called
req = scrapy.Request('https://foo.bar/checkout', callback=self.checkoutFinished)
self.crawler.engine.crawl(req, spider)
def checkoutFinished(self, response):
print("Checkout finished")
我有一个场景,我正在浏览一家商店,浏览 10 多页。然后当我找到我想要的商品时,我会把它加入购物车。
最后我要结账了。问题是,通过 scrapy 链接,它想要检查购物篮的次数与我购物篮中的物品一样多。
如何将链接的请求合并为一个,以便在将 10 件商品添加到购物车后,仅调用一次结帐?
def start_requests(self):
params = getShopList()
for param in params:
yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
method='POST', formdata=param)
def addToBasket(self, response):
yield scrapy.FormRequest('https://foo.bar/addToBasket', callback=self.checkoutBasket,
method='POST',
formdata=param)
def checkoutBasket(self, response):
yield scrapy.FormRequest('https://foo.bar/checkout', callback=self.final, method='POST',
formdata=param)
def final(self):
print("Success, you have purchased 59 items")
编辑:
我试图在关闭事件中发出请求,但它没有运行到请求中,也没有回调..
def closed(self, reason):
if reason == "finished":
print("spider finished")
return scrapy.Request('https://www.google.com', callback=self.finalmethod)
print("Spider closed but not finished.")
def finalmethod(self, response):
print("finalized")
我想你可以在蜘蛛完成后手动结帐:
def closed(self, reason):
if reason == "finished":
return requests.post(checkout_url, data=param)
print("Spider closed but not finished.")
参见closed。
更新
class MySpider(scrapy.Spider):
name = 'whatever'
def start_requests(self):
params = getShopList()
for param in params:
yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
method='POST', formdata=param)
def addToBasket(self, response):
yield scrapy.FormRequest('https://foo.bar/addToBasket',
method='POST', formdata=param)
def closed(self, reason):
if reason == "finished":
return requests.post(checkout_url, data=param)
print("Spider closed but not finished.")
我使用 Scrapy 信号和 spider_idle
调用解决了它。
Sent when a spider has gone idle, which means the spider has no further:
- 等待下载的请求数
- 请求已安排的项目
- 已在项目管道中处理
https://doc.scrapy.org/en/latest/topics/signals.html
from scrapy import signals, Spider
class MySpider(scrapy.Spider):
name = 'whatever'
def start_requests(self):
self.crawler.signals.connect(self.spider_idle, signals.spider_idle) ## notice this
params = getShopList()
for param in params:
yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
method='POST', formdata=param)
def addToBasket(self, response):
yield scrapy.FormRequest('https://foo.bar/addToBasket',
method='POST', formdata=param)
def spider_idle(self, spider): ## when all requests are finished, this is called
req = scrapy.Request('https://foo.bar/checkout', callback=self.checkoutFinished)
self.crawler.engine.crawl(req, spider)
def checkoutFinished(self, response):
print("Checkout finished")