如何从龙卷风请求中多次制作一个 scrapy 蜘蛛 运行

How to make a scrapy spider run multiple times from a tornado request

我有一个 Scrapy Spider 需要在调用 Tornado get 请求时 运行。第一次调用 Tornado 请求,蜘蛛 运行 没问题,但是当我再次向 Tornado 请求时,蜘蛛没有 运行 并出现以下错误提出:

Traceback (most recent call last):
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/tornado/web.py", line 1413, in _execute
        result = method(*self.path_args, **self.path_kwargs)
    File "server.py", line 38, in get
        process.start()
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
        reactor.run(installSignalHandlers=False)  # blocking call
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
        self.startRunning(installSignalHandlers=installSignalHandlers)
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
        ReactorBase.startRunning(self)
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
        raise error.ReactorNotRestartable()
ReactorNotRestartable

龙卷风的方法是:

class PageHandler(tornado.web.RequestHandler):

    def get(self):

        process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
            'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
        })

        process.crawl(YourSpider)
        process.start()

        self.write(json.dumps(results))

所以想法是总是调用 DirectoryHandler 方法,蜘蛛 运行s 并执行爬行。

好吧,在谷歌搜索了很多时间之后,我终于得到了解决这个问题的答案...... 有一个基于 croched 的库 scrapydo (https://github.com/darkrho/scrapydo) 并为您阻塞反应堆,允许重复使用同一个蜘蛛每次都需要。

所以要解决这个问题需要安装库,然后调用一次setup方法,然后使用run_spider方法...代码是喜欢:

import scrapydo
scrapydo.setup()


class PageHandler(tornado.web.RequestHandler):

    def get(self):

        scrapydo.run_spider(YourSpider(), settings={
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
            'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
        })

        self.write(json.dumps(results))

希望对遇到同样问题的人有所帮助!