如果不满足条件，如何让 scrapy spider 再次爬行？

Question

在我的 close 函数中，我正在检查是否存在今天抓取的文档，如果没有找到这样的文档，我想告诉我的 Spider 再次抓取。基本上，我需要一种可靠的方式让抓取工具继续调用其抓取例程，直到满足特定条件或 MAX_RETRIES 已用尽。

Answer 1

要在蜘蛛完成后执行蜘蛛，您需要使用 reactor 和 CrawlerRunner class。 crawl 方法 returns 在蜘蛛完成抓取后延迟，您可以使用它来添加一个 callback ，您可以在其中进行检查。请参见下面的示例，其中蜘蛛将重新运行，直到重试次数 >= 3，然后停止。

您需要小心如何进行检查，因为这是异步代码，代码执行的顺序可能与预期的不同。

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        yield {
            "url": response.url
        }

if __name__ == '__main__':
    RETRIES = 0
    configure_logging()
    runner = CrawlerRunner()
    d = runner.crawl(ExampleSpider)
    def finished():
        global RETRIES
        # do your checks in this callback and run the spider again if needed
        # in this example, we check if the number of retries is less than the required value
        # if not we stop the reactor
        if RETRIES < 3:
            RETRIES += 1
            d = runner.crawl(ExampleSpider)
            d.addBoth(lambda _: finished())

        else:
            reactor.stop() # stop the reactor if the condition is not met

    d.addBoth(lambda _: finished())
    reactor.run()

如果不满足条件，如何让 scrapy spider 再次爬行？

How to make scrapy spider crawl again if condition is not met?

python

scrapy