Scrapy:yield 表单请求打印 none?
Scrapy: yield form request prints none?
我正在写一个蜘蛛来抓取网站:
首先 url www.parenturl.com 调用解析函数,从那里我提取了 url www.childurl.com 我有一个回调到 parse2 函数并且它 returns字典.
问题 1) 我需要将 dict 值与我在解析函数中从父 url 中提取的其他 7 个值一起存储在 mysql 数据库中? (response_url 打印 none)
def parse(self, response):
for i in range(0,2):
url = response.xpath('//*[@id="response"]').extract()
response_url=yield SplashFormRequest(url,method='GET',callback=self.parse2)
print response_url # prints None
def parse2(self, response):
dict = {'url': response.url}
return dict
您不能将 yield 调用等同于变量,因为它的行为类似于 return 调用。
尝试删除它
def parse(self, response):
self.results = []
for i in range(0,2):
url = response.xpath('//*[@id="response"]').extract()
request = SplashFormRequest(url,method='GET',callback=self.parse2)
yield request
print self.results
def parse2(self, response):
# print response here !
dict = {'url': response.url}
self.results.append(dict)
由于 scrapy 的 asynchronous nature. Instead, you could try passing additional data to callback functions,将第二次回调的结果存储在蜘蛛对象上然后打印它不能保证工作,例如:
def parse(self, response):
for i in range(0, 2):
item = ... # extract some information
url = ... # construct URL
yield SplashFormRequest(url, callback=self.parse2, meta={'item': item})
def parse2(self, response):
item = response.meta['item'] # get data from previous parsing method
item.update({'key': 'value'}) # add more information
print item # do something with the "complete" item
return item
我正在写一个蜘蛛来抓取网站:
首先 url www.parenturl.com 调用解析函数,从那里我提取了 url www.childurl.com 我有一个回调到 parse2 函数并且它 returns字典.
问题 1) 我需要将 dict 值与我在解析函数中从父 url 中提取的其他 7 个值一起存储在 mysql 数据库中? (response_url 打印 none)
def parse(self, response):
for i in range(0,2):
url = response.xpath('//*[@id="response"]').extract()
response_url=yield SplashFormRequest(url,method='GET',callback=self.parse2)
print response_url # prints None
def parse2(self, response):
dict = {'url': response.url}
return dict
您不能将 yield 调用等同于变量,因为它的行为类似于 return 调用。
尝试删除它
def parse(self, response):
self.results = []
for i in range(0,2):
url = response.xpath('//*[@id="response"]').extract()
request = SplashFormRequest(url,method='GET',callback=self.parse2)
yield request
print self.results
def parse2(self, response):
# print response here !
dict = {'url': response.url}
self.results.append(dict)
由于 scrapy 的 asynchronous nature. Instead, you could try passing additional data to callback functions,将第二次回调的结果存储在蜘蛛对象上然后打印它不能保证工作,例如:
def parse(self, response):
for i in range(0, 2):
item = ... # extract some information
url = ... # construct URL
yield SplashFormRequest(url, callback=self.parse2, meta={'item': item})
def parse2(self, response):
item = response.meta['item'] # get data from previous parsing method
item.update({'key': 'value'}) # add more information
print item # do something with the "complete" item
return item