如何在顺序蜘蛛之间传递数据

How to pass data between sequential spiders

我有两个蜘蛛 运行 按照 https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process 的顺序排列。现在我想将一些信息从第一个蜘蛛传递给第二个蜘蛛(一个 selenium webdriver,或者它的会话信息)。

我对 scrapy 很陌生,但在另一个 post 上,建议将数据保存到数据库并从那里检索。对于只传递一个变量,这似乎有点过分,没有其他办法吗? (我知道在这个例子中我可以把它做成一个长蜘蛛,但后来我想 运行 第一个蜘蛛一次但第二个蜘蛛多次。)

class Spider1(scrapy.Spider):
    # Open a webdriver and get session_id

class Spider2(scrapy.Spider):
    # Get the session_id  and run spider2 code
    def __init__(self, session_id):
        ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    # TODO How to get the session_id?
    # session_id = yield runner.crawl(Spider1) returns None
    # Or adding return statement in Spider 1, actually breaks 
    # sequential processing and program sleeps before running Spider1

    time.sleep(2)

    yield runner.crawl(Spider2(session_id))
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

我想将变量传递给第二个蜘蛛的构造函数,但我无法从第一个蜘蛛获取数据。如果我只是 运行 第一个爬虫到 return 变量,它显然会破坏顺序结构。如果我尝试检索产量,结果是 None.

我完全瞎了吗?我不敢相信这竟然是一项如此复杂的任务。

你可以将一个队列传递给两个蜘蛛,让spider2阻塞在queue.get()上,所以不需要time.sleep(2).

# globals.py

queue = Queue()
# run.py

import globals


class Spider1(scrapy.Spider):
    def __init__(self):
        # put session_id to `globals.queue` somewhere in `Spider1`, so `Spider2` can start.
        ...

class Spider2(scrapy.Spider):
    def __init__(self):
        session_id = globals.queue.get()

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    yield runner.crawl(Spider2)
    reactor.stop()

crawl()
reactor.run() 

您也可以只创建 webdriver 并将其作为参数传递。当我最初尝试这样做时,它没有工作,因为我错误地传递了参数(请参阅我对 post 的评论)。

class Spider1(scrapy.Spider):
    def __init__(self, driver=None):
        self.driver = driver # Do whatever with the driver

class Spider2(scrapy.Spider):
   def __init__(self, driver=None):
       self.driver = driver # This is the same driver as Spider 1 used


configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    driver = webdriver.Chrome()

    yield runner.crawl(Spider1, driver=driver)
    yield runner.crawl(Spider2, driver=driver)

    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished