运行 同一进程中的多个蜘蛛,一次一个蜘蛛

Running multiple spiders in the same process, one spider at a time

我有一个 CrawlSpider 可以使用邮政编码和类别(POST 数据)搜索结果。我需要获取所有邮政编码中所有类别的所有结果。我的蜘蛛将邮政编码和类别作为 POST 数据的参数。我想通过脚本以编程方式为每个邮政 code/category 组合启动蜘蛛。

文档说明您可以使用此处的代码示例为每个进程 运行 多个蜘蛛:http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process 这与我想做的事情相同,但是我想基本上将蜘蛛排队以运行 在前一个蜘蛛完成后一个接一个。

关于如何实现这个的任何想法?似乎有一些答案适用于旧版本的 scrapy (~0.13),但架构已经改变,它们不再适用于最新的稳定版 (0.24.4)

您可以依赖 spider_closed signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer 并为您的用例采用:

from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from twisted.internet import reactor

# for the sake of an example, sample postal codes
postal_codes = ['10801', '10802', '10803']


def configure_crawler(postal_code):
    spider = MySpider(postal_code)

    # configure signals
    crawler.signals.connect(callback, signal=signals.spider_closed)

    # detach spider
    crawler._spider = None

    # configure and start the crawler
    crawler.configure()
    crawler.crawl(spider)


# callback fired when the spider is closed
def callback(spider, reason):
    try:
        postal_code = postal_codes.pop()
        configure_crawler(postal_code)
    except IndexError:
        # stop the reactor if no postal codes left
        reactor.stop()


settings = Settings()
crawler = Crawler(settings)
configure_crawler(postal_codes.pop())
crawler.start()

# start logging
log.start()

# start the reactor (blocks execution)
reactor.run()