多页 scrapy 生成我的项目太快无法完成 - 功能不链接和等待完成
Multi-page scrapy yielding my items too quickly to finish - functions not chaining and waiting for completion
我正在制作一款足球应用,试图了解多页面抓取的工作原理。
比如第一页(http://footballdatabase.com/ranking/world/1)是我要抓取的2组link:俱乐部名称link,分页link s
我想遍历 a) 每一页(分页),然后 b) 遍历每个俱乐部并获取其当前的 eu排名.
我编写的代码在某种程度上有效。然而,我最终只有大约 45 个结果,而不是 2000 多个俱乐部。 --注:分页有45页。所以它一循环就出现了,一切都完成了,我的物品被产出了。
我怎样才能将它们全部链接在一起,以便最终得到大约 2000 多个结果?
这是我的代码
# get Pagination links
def parse(self, response):
for href in response.css("ul.pagination > li > a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_club)
# get club links on each of the pagination pages
def parse_club(self, response):
# loop through each of the rows
for sel in response.xpath('//table/tbody/tr'):
item = rankingItem()
item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract()
# get more club information
club_href = sel.xpath('td[2]/a[1]/@href').extract_first()
club_url = response.urljoin(club_href)
request = scrapy.Request(club_url,callback=self.parse_club_page_2)
request.meta['item'] = item
return request
# get the EU ranking on each of the club pages
def parse_club_page_2(self,response):
item = response.meta['item']
item['eu_ranking'] = response.xpath('//a[@class="label label-default"][2]/text()').extract()
yield item
您需要从 parse_club
回调中 yield
- 而不是 return
:
# get club links on each of the pagination pages
def parse_club(self, response):
# loop through each of the rows
for sel in response.xpath('//table/tbody/tr'):
item = rankingItem()
item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract()
# get more club information
club_href = sel.xpath('td[2]/a[1]/@href').extract_first()
club_url = response.urljoin(club_href)
request = scrapy.Request(club_url,callback=self.parse_club_page_2)
request.meta['item'] = item
yield request # FIX HERE
我还将元素定位部分简化为:
def parse_club(self, response):
# loop through each of the rows
for sel in response.css('td.club'):
item = rankingItem()
item['name'] = sel.xpath('.//div[@itemprop="itemListElement"]/text()').extract_first()
# get more club information
club_href = sel.xpath('.//a/@href').extract_first()
club_url = response.urljoin(club_href)
request = scrapy.Request(club_url, callback=self.parse_club_page_2)
request.meta['item'] = item
yield request
我正在制作一款足球应用,试图了解多页面抓取的工作原理。
比如第一页(http://footballdatabase.com/ranking/world/1)是我要抓取的2组link:俱乐部名称link,分页link s
我想遍历 a) 每一页(分页),然后 b) 遍历每个俱乐部并获取其当前的 eu排名.
我编写的代码在某种程度上有效。然而,我最终只有大约 45 个结果,而不是 2000 多个俱乐部。 --注:分页有45页。所以它一循环就出现了,一切都完成了,我的物品被产出了。
我怎样才能将它们全部链接在一起,以便最终得到大约 2000 多个结果?
这是我的代码
# get Pagination links
def parse(self, response):
for href in response.css("ul.pagination > li > a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_club)
# get club links on each of the pagination pages
def parse_club(self, response):
# loop through each of the rows
for sel in response.xpath('//table/tbody/tr'):
item = rankingItem()
item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract()
# get more club information
club_href = sel.xpath('td[2]/a[1]/@href').extract_first()
club_url = response.urljoin(club_href)
request = scrapy.Request(club_url,callback=self.parse_club_page_2)
request.meta['item'] = item
return request
# get the EU ranking on each of the club pages
def parse_club_page_2(self,response):
item = response.meta['item']
item['eu_ranking'] = response.xpath('//a[@class="label label-default"][2]/text()').extract()
yield item
您需要从 parse_club
回调中 yield
- 而不是 return
:
# get club links on each of the pagination pages
def parse_club(self, response):
# loop through each of the rows
for sel in response.xpath('//table/tbody/tr'):
item = rankingItem()
item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract()
# get more club information
club_href = sel.xpath('td[2]/a[1]/@href').extract_first()
club_url = response.urljoin(club_href)
request = scrapy.Request(club_url,callback=self.parse_club_page_2)
request.meta['item'] = item
yield request # FIX HERE
我还将元素定位部分简化为:
def parse_club(self, response):
# loop through each of the rows
for sel in response.css('td.club'):
item = rankingItem()
item['name'] = sel.xpath('.//div[@itemprop="itemListElement"]/text()').extract_first()
# get more club information
club_href = sel.xpath('.//a/@href').extract_first()
club_url = response.urljoin(club_href)
request = scrapy.Request(club_url, callback=self.parse_club_page_2)
request.meta['item'] = item
yield request