运行 同一进程中的多个蜘蛛,一次一个蜘蛛
Running multiple spiders in the same process, one spider at a time
我有一个 CrawlSpider 可以使用邮政编码和类别(POST 数据)搜索结果。我需要获取所有邮政编码中所有类别的所有结果。我的蜘蛛将邮政编码和类别作为 POST 数据的参数。我想通过脚本以编程方式为每个邮政 code/category 组合启动蜘蛛。
文档说明您可以使用此处的代码示例为每个进程 运行 多个蜘蛛:http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process 这与我想做的事情相同,但是我想基本上将蜘蛛排队以运行 在前一个蜘蛛完成后一个接一个。
关于如何实现这个的任何想法?似乎有一些答案适用于旧版本的 scrapy (~0.13),但架构已经改变,它们不再适用于最新的稳定版 (0.24.4)
您可以依赖 spider_closed
signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer 并为您的用例采用:
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from twisted.internet import reactor
# for the sake of an example, sample postal codes
postal_codes = ['10801', '10802', '10803']
def configure_crawler(postal_code):
spider = MySpider(postal_code)
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# detach spider
crawler._spider = None
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
# callback fired when the spider is closed
def callback(spider, reason):
try:
postal_code = postal_codes.pop()
configure_crawler(postal_code)
except IndexError:
# stop the reactor if no postal codes left
reactor.stop()
settings = Settings()
crawler = Crawler(settings)
configure_crawler(postal_codes.pop())
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()
我有一个 CrawlSpider 可以使用邮政编码和类别(POST 数据)搜索结果。我需要获取所有邮政编码中所有类别的所有结果。我的蜘蛛将邮政编码和类别作为 POST 数据的参数。我想通过脚本以编程方式为每个邮政 code/category 组合启动蜘蛛。
文档说明您可以使用此处的代码示例为每个进程 运行 多个蜘蛛:http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process 这与我想做的事情相同,但是我想基本上将蜘蛛排队以运行 在前一个蜘蛛完成后一个接一个。
关于如何实现这个的任何想法?似乎有一些答案适用于旧版本的 scrapy (~0.13),但架构已经改变,它们不再适用于最新的稳定版 (0.24.4)
您可以依赖 spider_closed
signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer 并为您的用例采用:
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from twisted.internet import reactor
# for the sake of an example, sample postal codes
postal_codes = ['10801', '10802', '10803']
def configure_crawler(postal_code):
spider = MySpider(postal_code)
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# detach spider
crawler._spider = None
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
# callback fired when the spider is closed
def callback(spider, reason):
try:
postal_code = postal_codes.pop()
configure_crawler(postal_code)
except IndexError:
# stop the reactor if no postal codes left
reactor.stop()
settings = Settings()
crawler = Crawler(settings)
configure_crawler(postal_codes.pop())
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()