如何在顺序蜘蛛之间传递数据

Question

我有两个蜘蛛运行按照 https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process 的顺序排列。现在我想将一些信息从第一个蜘蛛传递给第二个蜘蛛（一个 selenium webdriver，或者它的会话信息）。

我对 scrapy 很陌生，但在另一个 post 上，建议将数据保存到数据库并从那里检索。对于只传递一个变量，这似乎有点过分，没有其他办法吗？（我知道在这个例子中我可以把它做成一个长蜘蛛，但后来我想运行第一个蜘蛛一次但第二个蜘蛛多次。）

class Spider1(scrapy.Spider):
    # Open a webdriver and get session_id

class Spider2(scrapy.Spider):
    # Get the session_id  and run spider2 code
    def __init__(self, session_id):
        ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    # TODO How to get the session_id?
    # session_id = yield runner.crawl(Spider1) returns None
    # Or adding return statement in Spider 1, actually breaks 
    # sequential processing and program sleeps before running Spider1

    time.sleep(2)

    yield runner.crawl(Spider2(session_id))
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

我想将变量传递给第二个蜘蛛的构造函数，但我无法从第一个蜘蛛获取数据。如果我只是运行第一个爬虫到 return 变量，它显然会破坏顺序结构。如果我尝试检索产量，结果是 None.

我完全瞎了吗？我不敢相信这竟然是一项如此复杂的任务。

Answer 1

你可以将一个队列传递给两个蜘蛛，让spider2阻塞在queue.get()上，所以不需要time.sleep(2).

# globals.py

queue = Queue()

# run.py

import globals


class Spider1(scrapy.Spider):
    def __init__(self):
        # put session_id to `globals.queue` somewhere in `Spider1`, so `Spider2` can start.
        ...

class Spider2(scrapy.Spider):
    def __init__(self):
        session_id = globals.queue.get()

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    yield runner.crawl(Spider2)
    reactor.stop()

crawl()
reactor.run()

Answer 2

您也可以只创建 webdriver 并将其作为参数传递。当我最初尝试这样做时，它没有工作，因为我错误地传递了参数（请参阅我对 post 的评论）。

class Spider1(scrapy.Spider):
    def __init__(self, driver=None):
        self.driver = driver # Do whatever with the driver

class Spider2(scrapy.Spider):
   def __init__(self, driver=None):
       self.driver = driver # This is the same driver as Spider 1 used


configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    driver = webdriver.Chrome()

    yield runner.crawl(Spider1, driver=driver)
    yield runner.crawl(Spider2, driver=driver)

    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

如何在顺序蜘蛛之间传递数据

How to pass data between sequential spiders

python

scrapy

web-scraping

sequential