Scrapy - idle signal spider 运行 进入错误
Scrapy - idle signal spider running into an error
我正在尝试创建一个蜘蛛,它一直是 运行,一旦它到达空闲状态,它应该获取下一个 url 以从数据库中解析.
不幸的是,我一开始就得到了堆栈:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals
from scrapy import Spider
import logging
class SignalspiderSpider(Spider):
name = 'signalspider'
allowed_domains = ['domain.de']
yet = False
def start_requests(self):
logging.log(logging.INFO, "______ Loading requests")
yield scrapy.Request('https://www.domain.de/product1.html')
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
logging.log(logging.INFO, "______ From Crawler")
spider = spider = super(SignalspiderSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
return spider
def parse(self, response):
self.logger.info("______ Finished extracting structured data from HTML")
pass
def idle(self):
logging.log(logging.INFO, "_______ Idle state")
if not self.yet:
self.crawler.engine.crawl(self.create_request(), self)
self.yet = True
def create_request(self):
logging.log(logging.INFO, "_____________ Create requests")
yield scrapy.Request('https://www.domain.de/product2.html?dvar_82_color=blau&cgid=')
以及我得到的错误:
2019-03-27 21:41:38 [root] INFO: _______ Idle state
2019-03-27 21:41:38 [root] INFO: _____________ Create requests
2019-03-27 21:41:38 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RefererMiddleware.request_scheduled of <scrapy.spidermiddlewares.referer.RefererMiddleware object at 0x7f93bcc13978>>
Traceback (most recent call last):
File "/home/spidy/Documents/spo/lib/python3.5/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/home/spidy/Documents/spo/lib/python3.5/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/home/spidy/Documents/spo/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'NoneType' object has no attribute 'meta'
我做错了什么?
试试:
def idle(self, spider):
logging.log(logging.INFO, "_______ Idle state")
if not self.yet:
self.yet = True
self.crawler.engine.crawl(Request(url='https://www.domain.de/product2.html?dvar_82_color=blau&cgid=', callback=spider.parse), spider)
我不确定在方法 spider_idle 中创建请求是否正确,传递另一个发出请求的方法,就像你所做的那样。
在
查看更多
我正在尝试创建一个蜘蛛,它一直是 运行,一旦它到达空闲状态,它应该获取下一个 url 以从数据库中解析. 不幸的是,我一开始就得到了堆栈:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals
from scrapy import Spider
import logging
class SignalspiderSpider(Spider):
name = 'signalspider'
allowed_domains = ['domain.de']
yet = False
def start_requests(self):
logging.log(logging.INFO, "______ Loading requests")
yield scrapy.Request('https://www.domain.de/product1.html')
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
logging.log(logging.INFO, "______ From Crawler")
spider = spider = super(SignalspiderSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
return spider
def parse(self, response):
self.logger.info("______ Finished extracting structured data from HTML")
pass
def idle(self):
logging.log(logging.INFO, "_______ Idle state")
if not self.yet:
self.crawler.engine.crawl(self.create_request(), self)
self.yet = True
def create_request(self):
logging.log(logging.INFO, "_____________ Create requests")
yield scrapy.Request('https://www.domain.de/product2.html?dvar_82_color=blau&cgid=')
以及我得到的错误:
2019-03-27 21:41:38 [root] INFO: _______ Idle state
2019-03-27 21:41:38 [root] INFO: _____________ Create requests
2019-03-27 21:41:38 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RefererMiddleware.request_scheduled of <scrapy.spidermiddlewares.referer.RefererMiddleware object at 0x7f93bcc13978>>
Traceback (most recent call last):
File "/home/spidy/Documents/spo/lib/python3.5/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/home/spidy/Documents/spo/lib/python3.5/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/home/spidy/Documents/spo/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'NoneType' object has no attribute 'meta'
我做错了什么?
试试:
def idle(self, spider):
logging.log(logging.INFO, "_______ Idle state")
if not self.yet:
self.yet = True
self.crawler.engine.crawl(Request(url='https://www.domain.de/product2.html?dvar_82_color=blau&cgid=', callback=spider.parse), spider)
我不确定在方法 spider_idle 中创建请求是否正确,传递另一个发出请求的方法,就像你所做的那样。
在