Scrapy

Question

我有一个用于抓取网站的蜘蛛，我想运行每 10 分钟抓取一次。将其放入 python 时间表并运行。在第一个运行之后我得到了

ReactorNotRestartable

我尝试得到了

AttributeError: Can't pickle local object 'run_spider..f'

错误。

编辑：尝试 python 程序运行没有错误和抓取功能运行每 30 秒但是蜘蛛没有运行而且我没有得到数据。

def run_spider():
def f(q):
    try:
        runner = crawler.CrawlerRunner()
        deferred = runner.crawl(DivarSpider)
        #deferred.addBoth(lambda _: reactor.stop())
        #reactor.run()
        q.put(None)
    except Exception as e:
        q.put(e)

runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)

q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()

if result is not None:
    raise result

Answer 1

我知道最简单的方法是使用单独的脚本来调用包含扭曲反应器的脚本，如下所示：

cmd = ['python3', 'auto_crawl.py']
subprocess.Popen(cmd).wait()

每 10 分钟运行您的 CrawlerRunner，您可以在此脚本上使用循环或 crontab。

Answer 2

多处理解决方案是一个严重的黑客攻击，以解决缺乏对 Scrapy 和反应器管理如何工作的理解。你可以摆脱它，一切都简单得多。

from twisted.internet.task import LoopingCall
from twisted.internet import reactor

from scrapy.crawler import CrawlRunner
from scrapy.utils.log import configure_logging

from yourlib import YourSpider

configure_logging()
runner = CrawlRunner()
task = LoopingCall(lambda: runner.crawl(YourSpider()))
task.start(60 * 10)
reactor.run()

Scrapy - 运行在时间间隔

Scrapy - run at time interval

twisted

pickle

python-3.x

Scrapy - 运行 在时间间隔

Scrapy - run at time interval

twisted

pickle

scrapy

python-3.x

Scrapy - 运行在时间间隔