Scrapy：运行 spiders sequential 每个蜘蛛都有不同的设置

Question

好几天了，我在 Main.py 中遇到了 Scrapy/twisted 的问题，它应该运行不同的蜘蛛并分析他们的输出。不幸的是，MySpider2 依赖于 来自 MySpider1 的 FEED，因此 只能在 MySpider1 完成后运行。 此外，MySpider1 和 MySpider2 有不同的设置。 到目前为止，我还没有找到让我运行蜘蛛按顺序使用各自独特设置的解决方案。我查看了 Scrapy Crawler运行ner 和 CrawlerProcess docs, and experimented with several related Whosebug questions (Run Multiple Spider sequentially, Scrapy: how to run two crawlers one after another?, Scrapy run multiple spiders from a script 以及其他）但没有成功。

根据顺序爬虫的文档，我的（稍微修改过的）代码是：

from MySpider1.myspider1.spiders.myspider1 import MySpider1
from MySpider2.myspider2.spiders.myspider2 import MySpider2
from twisted.internet import defer, reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner

spider_settings = [{
    'FEED_URI':'abc.csv',
    'LOG_FILE' :'abc/log.log'
    #MORE settings are here
    },{
    'FEED_URI' : '123.csv',
    'LOG_FILE' :'123/log.log'
    #MORE settings are here
    }]

spiders = [MySpider1, MySpider2]

process = CrawlerRunner(spider_settings[0])
process = CrawlerRunner(spider_settings[1]) #Not sure if this is how its supposed to be used for
#multiple settings but passing this line before "yield process.crawl(spiders[1])" also results in an error.

@defer.inlineCallbacks
def crawl():
    yield process.crawl(spiders[0])
    yield process.crawl(spiders[1])
    reactor.stop()
crawl()
reactor.run()

然而，使用这段代码，只有第一个蜘蛛被执行并且没有任何设置。因此，我尝试使用效果更好的CrawlerProcess：

from MySpider1.myspider1.spiders.myspider1 import MySpider1
from MySpider2.myspider2.spiders.myspider2 import MySpider2
from twisted.internet import defer, reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner

spider_settings = [{
    'FEED_URI':'abc.csv',
    'LOG_FILE' :'abc/log.log'
    #MORE settings are here
    },{
    'FEED_URI' : '123.csv',
    'LOG_FILE' :'123/log.log'
    #MORE settings are here
    }]

spiders = [MySpider1, MySpider2]

process = CrawlerProcess(spider_settings[0])
process = CrawlerProcess(spider_settings[1])

@defer.inlineCallbacks
def crawl():
    yield process.crawl(spiders[0])
    yield process.crawl(spiders[1])
    reactor.stop()
crawl()
reactor.run()

此代码同时执行两个蜘蛛，但不是按预期顺序执行。此外，它还会在一秒钟后用 spider[1] 的设置覆盖 spider[0] 的设置，导致日志文件仅在两行后被截断，并在 123/log.log 处为两个 spider 恢复。

在完美的世界中，我的代码片段将按如下方式工作：

运行蜘蛛[0] 与 spider_settings[0]
等到 spider[0] 完成。
运行蜘蛛[1] 与 spider_settings[1]

在此先感谢您的帮助。

Answer 1

将跑步者分开，它应该有效

process_1 = CrawlerRunner(spider_settings[0])
process_2 = CrawlerRunner(spider_settings[1])

#...

@defer.inlineCallbacks
def crawl():
    yield process_1.crawl(spiders[0])
    yield process_2.crawl(spiders[1])
    reactor.stop()

#...

Scrapy：运行 spiders sequential 每个蜘蛛都有不同的设置

Scrapy: Run spiders seqential with different settings for each spider

python

twisted

scrapy