Scrapy 在长 运行 过程中多次爬取
Scrapy crawl multiple times in long running process
所以,我做了这个 class 这样我就可以使用 Scrapy 按需爬取:
from scrapy import signals
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy.settings import Settings
class NewsCrawler(object):
def __init__(self, spiders=[]):
self.spiders = spiders
self.settings = Settings()
def crawl(self, start_date, end_date):
crawled_items = []
def add_item(item):
crawled_items.append(item)
process = CrawlerProcess(self.settings)
for spider in self.spiders:
crawler = Crawler(spider, self.settings)
crawler.signals.connect(add_item, signals.item_scraped)
process.crawl(crawler, start_date=start_date, end_date=end_date)
process.start()
return crawled_items
基本上,我有一个很长的 运行ning 过程,我会多次调用上面的 class' crawl
方法,就像这样:
import time
crawler = NewsCrawler(spiders=[Spider1, Spider2])
while True:
items = crawler.crawl(start_date, end_date)
# do something with crawled items ...
time.sleep(3600)
问题是,第二次调用crawl
,会出现这个错误:twisted.internet.error.ReactorNotRestartable
.
据我了解,这是因为 reactor 在停止后无法 运行。有什么解决方法吗?
谢谢!
这是目前 scrapy(twisted) 的一个限制,使得很难将 scrapy 作为一个库使用。
您可以做的是派生一个新进程,该进程运行爬虫并在爬网完成时停止反应器。然后,您可以等待加入并在爬网完成后生成一个新进程。如果您想在主线程中处理项目,您可以 post 将结果放入队列。不过,我建议为您的项目使用自定义管道。
看看我下面的回答:
您应该能够应用相同的原则。但是你宁愿使用多处理而不是台球。
基于@bj-blazkowicz 上面的回答。我找到了 CrawlerRunner 的解决方案,当 运行 多个蜘蛛时推荐使用 CrawlerRunner,如文档 https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
中所述
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
主文件中的代码:
from multiprocessing import Process, Queue
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
# Enable logging for CrawlerRunner
configure_logging()
class CrawlerRunnerProcess(Process):
def __init__(self, spider, q, *args):
Process.__init__(self)
self.runner = CrawlerRunner(get_project_settings())
self.spider = spider
self.q = q
self.args = args
def run(self):
deferred = self.runner.crawl(self.spider, self.q, self.args)
deferred.addBoth(lambda _: reactor.stop())
reactor.run(installSignalHandlers=False)
# The wrapper to make it run multiple spiders, multiple times
def run_spider(spider, *args): # optional arguments
q = Queue() # optional queue to return spider results
runner = CrawlerRunnerProcess(spider, q, *args)
runner.start()
runner.join()
return q.get()
蜘蛛文件中的代码:
class MySpider(Spider):
name = 'my_spider'
def __init__(self, q, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.q = q # optional queue
self.args = args # optional args
def parse(self, response):
my_item = MyItem()
self.q.put(my_item)
yield my_item
所以,我做了这个 class 这样我就可以使用 Scrapy 按需爬取:
from scrapy import signals
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy.settings import Settings
class NewsCrawler(object):
def __init__(self, spiders=[]):
self.spiders = spiders
self.settings = Settings()
def crawl(self, start_date, end_date):
crawled_items = []
def add_item(item):
crawled_items.append(item)
process = CrawlerProcess(self.settings)
for spider in self.spiders:
crawler = Crawler(spider, self.settings)
crawler.signals.connect(add_item, signals.item_scraped)
process.crawl(crawler, start_date=start_date, end_date=end_date)
process.start()
return crawled_items
基本上,我有一个很长的 运行ning 过程,我会多次调用上面的 class' crawl
方法,就像这样:
import time
crawler = NewsCrawler(spiders=[Spider1, Spider2])
while True:
items = crawler.crawl(start_date, end_date)
# do something with crawled items ...
time.sleep(3600)
问题是,第二次调用crawl
,会出现这个错误:twisted.internet.error.ReactorNotRestartable
.
据我了解,这是因为 reactor 在停止后无法 运行。有什么解决方法吗?
谢谢!
这是目前 scrapy(twisted) 的一个限制,使得很难将 scrapy 作为一个库使用。
您可以做的是派生一个新进程,该进程运行爬虫并在爬网完成时停止反应器。然后,您可以等待加入并在爬网完成后生成一个新进程。如果您想在主线程中处理项目,您可以 post 将结果放入队列。不过,我建议为您的项目使用自定义管道。
看看我下面的回答:
您应该能够应用相同的原则。但是你宁愿使用多处理而不是台球。
基于@bj-blazkowicz 上面的回答。我找到了 CrawlerRunner 的解决方案,当 运行 多个蜘蛛时推荐使用 CrawlerRunner,如文档 https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
中所述There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
主文件中的代码:
from multiprocessing import Process, Queue
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
# Enable logging for CrawlerRunner
configure_logging()
class CrawlerRunnerProcess(Process):
def __init__(self, spider, q, *args):
Process.__init__(self)
self.runner = CrawlerRunner(get_project_settings())
self.spider = spider
self.q = q
self.args = args
def run(self):
deferred = self.runner.crawl(self.spider, self.q, self.args)
deferred.addBoth(lambda _: reactor.stop())
reactor.run(installSignalHandlers=False)
# The wrapper to make it run multiple spiders, multiple times
def run_spider(spider, *args): # optional arguments
q = Queue() # optional queue to return spider results
runner = CrawlerRunnerProcess(spider, q, *args)
runner.start()
runner.join()
return q.get()
蜘蛛文件中的代码:
class MySpider(Spider):
name = 'my_spider'
def __init__(self, q, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.q = q # optional queue
self.args = args # optional args
def parse(self, response):
my_item = MyItem()
self.q.put(my_item)
yield my_item