为什么从 pywikibot 调用 scrapy spider 会出现 ReactorNotRestartable 错误?
Why does calling a scrapy spider from pywikibot give a ReactorNotRestartable error?
我可以使用 CrawlerRunner 或 CrawlerProcess 从另一个 Python 脚本调用 scrapy 蜘蛛。但是,当我尝试从 pywikibot 机器人调用同一个蜘蛛调用 class 时,我收到 ReactorNotRestartable 错误。为什么会这样,我该如何解决?
这里是错误:
File ".\scripts\userscripts\ReplicationWiki\RWLoad.py", line 161, in format_new_page
aea = AEAMetadata(url=DOI_url)
File ".\scripts\userscripts\ReplicationWiki\GetAEAMetadata.py", line 39, in __init__
reactor.run() # the script will block here until all crawling jobs are finished
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
CRITICAL: Exiting due to uncaught exception <class 'twisted.internet.error.ReactorNotRestartable'>
这是调用我的 scrapy 蜘蛛的脚本。如果我只是从 main.
调用 class 它运行良好
from twisted.internet import reactor, defer
from scrapy import signals
from scrapy.crawler import Crawler, CrawlerProcess, CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from Scrapers.spiders.ScrapeAEA import ScrapeaeaSpider
class AEAMetadata:
"""
Helper to run ScrapeAEA spider and return JEL codes and data links
for a given AEA article link.
"""
def __init__(self, *args, **kwargs):
"""Initializer"""
url = kwargs.get('url')
if not url:
raise ValueError('No article url given')
self.items = []
def collect_items(item, response, spider):
self.items.append(item)
settings = get_project_settings()
crawler = Crawler(ScrapeaeaSpider, settings)
crawler.signals.connect(collect_items, signals.item_scraped)
runner = CrawlerRunner(settings)
d = runner.crawl(crawler, url=url)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
#process = CrawlerProcess(settings)
#process.crawl(crawler, url=url)
#process.start() # the script will block here until the crawling is finished
def get_jelcodes(self):
jelcodes = self.items[0]['jelcodes']
return jelcodes
def main():
aea = AEAMetadata(url='https://doi.org/10.1257/app.20180286')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
更新了将 AEAMetadata class 实例化两次的简单测试。
这是我的 pywikibot bot 中失败的调用代码:
from GetAEAMetadata import AEAMetadata
def main(*args):
for _ in [1,2]:
print('Top')
url = 'https://doi.org/10.1257/app.20170442'
aea = AEAMetadata(url=url)
print('After AEAMetadata')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
我对 AEAMetadata 的调用被嵌入到一个更大的脚本中,这让我误以为 AEAMetadata class 在失败前只实例化了一次。
事实上,AEAMetadata 被调用了两次。
而且,我还认为脚本会在 reactor.run() 之后阻塞,因为所有 scrapy 示例中的评论都指出是这种情况。
但是,第二个延迟回调是 reactor.stop() ,它解除了 reactor.run().
的阻塞
一个更基本的错误假设是每次迭代都会删除并重新创建反应堆。其实reactor是在first导入的时候实例化和初始化的。而且,它是一个全局对象,与底层进程一样长,并且不会被重新启动。此处描述了删除和重新启动反应器实际需要的极端情况:
http://www.blog.pythonlibrary.org/2016/09/14/restarting-a-twisted-reactor/
所以,我想我已经回答了我自己的问题。
而且,我正在重写我的脚本,因此它不会尝试以从未打算使用的方式使用反应器。
还有,感谢 Gallaecio 让我朝着正确的方向思考。
我可以使用 CrawlerRunner 或 CrawlerProcess 从另一个 Python 脚本调用 scrapy 蜘蛛。但是,当我尝试从 pywikibot 机器人调用同一个蜘蛛调用 class 时,我收到 ReactorNotRestartable 错误。为什么会这样,我该如何解决?
这里是错误:
File ".\scripts\userscripts\ReplicationWiki\RWLoad.py", line 161, in format_new_page
aea = AEAMetadata(url=DOI_url)
File ".\scripts\userscripts\ReplicationWiki\GetAEAMetadata.py", line 39, in __init__
reactor.run() # the script will block here until all crawling jobs are finished
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
CRITICAL: Exiting due to uncaught exception <class 'twisted.internet.error.ReactorNotRestartable'>
这是调用我的 scrapy 蜘蛛的脚本。如果我只是从 main.
调用 class 它运行良好from twisted.internet import reactor, defer
from scrapy import signals
from scrapy.crawler import Crawler, CrawlerProcess, CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from Scrapers.spiders.ScrapeAEA import ScrapeaeaSpider
class AEAMetadata:
"""
Helper to run ScrapeAEA spider and return JEL codes and data links
for a given AEA article link.
"""
def __init__(self, *args, **kwargs):
"""Initializer"""
url = kwargs.get('url')
if not url:
raise ValueError('No article url given')
self.items = []
def collect_items(item, response, spider):
self.items.append(item)
settings = get_project_settings()
crawler = Crawler(ScrapeaeaSpider, settings)
crawler.signals.connect(collect_items, signals.item_scraped)
runner = CrawlerRunner(settings)
d = runner.crawl(crawler, url=url)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
#process = CrawlerProcess(settings)
#process.crawl(crawler, url=url)
#process.start() # the script will block here until the crawling is finished
def get_jelcodes(self):
jelcodes = self.items[0]['jelcodes']
return jelcodes
def main():
aea = AEAMetadata(url='https://doi.org/10.1257/app.20180286')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
更新了将 AEAMetadata class 实例化两次的简单测试。 这是我的 pywikibot bot 中失败的调用代码:
from GetAEAMetadata import AEAMetadata
def main(*args):
for _ in [1,2]:
print('Top')
url = 'https://doi.org/10.1257/app.20170442'
aea = AEAMetadata(url=url)
print('After AEAMetadata')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
我对 AEAMetadata 的调用被嵌入到一个更大的脚本中,这让我误以为 AEAMetadata class 在失败前只实例化了一次。 事实上,AEAMetadata 被调用了两次。
而且,我还认为脚本会在 reactor.run() 之后阻塞,因为所有 scrapy 示例中的评论都指出是这种情况。 但是,第二个延迟回调是 reactor.stop() ,它解除了 reactor.run().
的阻塞一个更基本的错误假设是每次迭代都会删除并重新创建反应堆。其实reactor是在first导入的时候实例化和初始化的。而且,它是一个全局对象,与底层进程一样长,并且不会被重新启动。此处描述了删除和重新启动反应器实际需要的极端情况: http://www.blog.pythonlibrary.org/2016/09/14/restarting-a-twisted-reactor/
所以,我想我已经回答了我自己的问题。 而且,我正在重写我的脚本,因此它不会尝试以从未打算使用的方式使用反应器。
还有,感谢 Gallaecio 让我朝着正确的方向思考。