如何在顺序蜘蛛之间传递数据
How to pass data between sequential spiders
我有两个蜘蛛 运行 按照 https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process 的顺序排列。现在我想将一些信息从第一个蜘蛛传递给第二个蜘蛛(一个 selenium webdriver,或者它的会话信息)。
我对 scrapy 很陌生,但在另一个 post 上,建议将数据保存到数据库并从那里检索。对于只传递一个变量,这似乎有点过分,没有其他办法吗?
(我知道在这个例子中我可以把它做成一个长蜘蛛,但后来我想 运行 第一个蜘蛛一次但第二个蜘蛛多次。)
class Spider1(scrapy.Spider):
# Open a webdriver and get session_id
class Spider2(scrapy.Spider):
# Get the session_id and run spider2 code
def __init__(self, session_id):
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1)
# TODO How to get the session_id?
# session_id = yield runner.crawl(Spider1) returns None
# Or adding return statement in Spider 1, actually breaks
# sequential processing and program sleeps before running Spider1
time.sleep(2)
yield runner.crawl(Spider2(session_id))
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
我想将变量传递给第二个蜘蛛的构造函数,但我无法从第一个蜘蛛获取数据。如果我只是 运行 第一个爬虫到 return 变量,它显然会破坏顺序结构。如果我尝试检索产量,结果是 None.
我完全瞎了吗?我不敢相信这竟然是一项如此复杂的任务。
你可以将一个队列传递给两个蜘蛛,让spider2
阻塞在queue.get()上,所以不需要time.sleep(2)
.
# globals.py
queue = Queue()
# run.py
import globals
class Spider1(scrapy.Spider):
def __init__(self):
# put session_id to `globals.queue` somewhere in `Spider1`, so `Spider2` can start.
...
class Spider2(scrapy.Spider):
def __init__(self):
session_id = globals.queue.get()
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1)
yield runner.crawl(Spider2)
reactor.stop()
crawl()
reactor.run()
您也可以只创建 webdriver 并将其作为参数传递。当我最初尝试这样做时,它没有工作,因为我错误地传递了参数(请参阅我对 post 的评论)。
class Spider1(scrapy.Spider):
def __init__(self, driver=None):
self.driver = driver # Do whatever with the driver
class Spider2(scrapy.Spider):
def __init__(self, driver=None):
self.driver = driver # This is the same driver as Spider 1 used
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
driver = webdriver.Chrome()
yield runner.crawl(Spider1, driver=driver)
yield runner.crawl(Spider2, driver=driver)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
我有两个蜘蛛 运行 按照 https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process 的顺序排列。现在我想将一些信息从第一个蜘蛛传递给第二个蜘蛛(一个 selenium webdriver,或者它的会话信息)。
我对 scrapy 很陌生,但在另一个 post 上,建议将数据保存到数据库并从那里检索。对于只传递一个变量,这似乎有点过分,没有其他办法吗? (我知道在这个例子中我可以把它做成一个长蜘蛛,但后来我想 运行 第一个蜘蛛一次但第二个蜘蛛多次。)
class Spider1(scrapy.Spider):
# Open a webdriver and get session_id
class Spider2(scrapy.Spider):
# Get the session_id and run spider2 code
def __init__(self, session_id):
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1)
# TODO How to get the session_id?
# session_id = yield runner.crawl(Spider1) returns None
# Or adding return statement in Spider 1, actually breaks
# sequential processing and program sleeps before running Spider1
time.sleep(2)
yield runner.crawl(Spider2(session_id))
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
我想将变量传递给第二个蜘蛛的构造函数,但我无法从第一个蜘蛛获取数据。如果我只是 运行 第一个爬虫到 return 变量,它显然会破坏顺序结构。如果我尝试检索产量,结果是 None.
我完全瞎了吗?我不敢相信这竟然是一项如此复杂的任务。
你可以将一个队列传递给两个蜘蛛,让spider2
阻塞在queue.get()上,所以不需要time.sleep(2)
.
# globals.py
queue = Queue()
# run.py
import globals
class Spider1(scrapy.Spider):
def __init__(self):
# put session_id to `globals.queue` somewhere in `Spider1`, so `Spider2` can start.
...
class Spider2(scrapy.Spider):
def __init__(self):
session_id = globals.queue.get()
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1)
yield runner.crawl(Spider2)
reactor.stop()
crawl()
reactor.run()
您也可以只创建 webdriver 并将其作为参数传递。当我最初尝试这样做时,它没有工作,因为我错误地传递了参数(请参阅我对 post 的评论)。
class Spider1(scrapy.Spider):
def __init__(self, driver=None):
self.driver = driver # Do whatever with the driver
class Spider2(scrapy.Spider):
def __init__(self, driver=None):
self.driver = driver # This is the same driver as Spider 1 used
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
driver = webdriver.Chrome()
yield runner.crawl(Spider1, driver=driver)
yield runner.crawl(Spider2, driver=driver)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished