AWS lambda,scrapy 和捕获异常
AWS lambda, scrapy and catching exceptions
我运行将 scrapy 用作 AWS lambda 函数。在我的函数中,我需要一个计时器来查看它是否 运行 超过 1 分钟,如果是,我需要 运行 一些逻辑。这是我的代码:
def handler():
x = 60
watchdog = Watchdog(x)
try:
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
except Watchdog:
print('Timeout error: process takes longer than %s seconds.' % x)
# some other logic here
watchdog.stop()
看门狗定时器class我从this answer拿来的。问题是代码永远不会命中 except Watchdog
块,而是在外部抛出异常:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 1182, in run
self.function(*self.args, **self.kwargs)
File "./functions/python/my_scrapy/index.py", line 174, in defaultHandler
raise self
functions.python.my_scrapy.index.Watchdog: 1
我需要在函数中捕获异常。我该怎么做。
PS:我对 Python 很陌生。
好吧,这个问题让我有点抓狂,这就是为什么它不起作用:
Watchdog
对象所做的是创建另一个线程,在该线程中引发异常但未处理(异常仅在主进程中处理)。幸运的是,twisted 有一些简洁的功能。
你可以做到 运行 另一个线程中的反应器:
import time
from threading import Thread
from twisted.internet import reactor
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
Thread(target=reactor.run, args=(False,)).start() # reactor will run in a different thread so it doesn't lock the script here
time.sleep(60) # Lock script here
# Now check if it's still scraping
if reactor.running:
# do something
else:
# do something else
我正在使用 python 3.7.0
Twisted 具有调度原语。例如,这个程序运行大约 60 秒:
from twisted.internet import reactor
reactor.callLater(60, reactor.stop)
reactor.run()
我运行将 scrapy 用作 AWS lambda 函数。在我的函数中,我需要一个计时器来查看它是否 运行 超过 1 分钟,如果是,我需要 运行 一些逻辑。这是我的代码:
def handler():
x = 60
watchdog = Watchdog(x)
try:
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
except Watchdog:
print('Timeout error: process takes longer than %s seconds.' % x)
# some other logic here
watchdog.stop()
看门狗定时器class我从this answer拿来的。问题是代码永远不会命中 except Watchdog
块,而是在外部抛出异常:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 1182, in run
self.function(*self.args, **self.kwargs)
File "./functions/python/my_scrapy/index.py", line 174, in defaultHandler
raise self
functions.python.my_scrapy.index.Watchdog: 1
我需要在函数中捕获异常。我该怎么做。 PS:我对 Python 很陌生。
好吧,这个问题让我有点抓狂,这就是为什么它不起作用:
Watchdog
对象所做的是创建另一个线程,在该线程中引发异常但未处理(异常仅在主进程中处理)。幸运的是,twisted 有一些简洁的功能。
你可以做到 运行 另一个线程中的反应器:
import time
from threading import Thread
from twisted.internet import reactor
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
Thread(target=reactor.run, args=(False,)).start() # reactor will run in a different thread so it doesn't lock the script here
time.sleep(60) # Lock script here
# Now check if it's still scraping
if reactor.running:
# do something
else:
# do something else
我正在使用 python 3.7.0
Twisted 具有调度原语。例如,这个程序运行大约 60 秒:
from twisted.internet import reactor
reactor.callLater(60, reactor.stop)
reactor.run()