多页 scrapy 生成我的项目太快无法完成 - 功能不链接和等待完成

Multi-page scrapy yielding my items too quickly to finish - functions not chaining and waiting for completion

我正在制作一款足球应用,试图了解多页面抓取的工作原理。

比如第一页(http://footballdatabase.com/ranking/world/1)是我要抓取的2组link:俱乐部名称link,分页link s

我想遍历 a) 每一页(分页),然后 b) 遍历每个俱乐部并获取其当前的 eu排名.

我编写的代码在某种程度上有效。然而,我最终只有大约 45 个结果,而不是 2000 多个俱乐部。 --注:分页有45页。所以它一循环就出现了,一切都完成了,我的物品被产出了。

我怎样才能将它们全部链接在一起,以便最终得到大约 2000 多个结果?

这是我的代码

# get Pagination links
def parse(self, response):
    for href in response.css("ul.pagination > li > a::attr('href')"):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_club)

# get club links on each of the pagination pages
def parse_club(self, response):


    # loop through each of the rows
    for sel in response.xpath('//table/tbody/tr'):

        item = rankingItem()

            item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract()

            # get more club information
            club_href = sel.xpath('td[2]/a[1]/@href').extract_first()  
            club_url = response.urljoin(club_href) 
            request = scrapy.Request(club_url,callback=self.parse_club_page_2)

            request.meta['item'] = item
            return request

# get the EU ranking on each of the club pages
def parse_club_page_2(self,response):

    item = response.meta['item']
    item['eu_ranking'] = response.xpath('//a[@class="label label-default"][2]/text()').extract() 

    yield item

您需要从 parse_club 回调中 yield - 而不是 return:

# get club links on each of the pagination pages
def parse_club(self, response):
    # loop through each of the rows
    for sel in response.xpath('//table/tbody/tr'):    
        item = rankingItem()    
        item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract()

        # get more club information
        club_href = sel.xpath('td[2]/a[1]/@href').extract_first()  
        club_url = response.urljoin(club_href) 
        request = scrapy.Request(club_url,callback=self.parse_club_page_2)

        request.meta['item'] = item
        yield request  # FIX HERE

我还将元素定位部分简化为:

def parse_club(self, response):
    # loop through each of the rows
    for sel in response.css('td.club'):
        item = rankingItem()
        item['name'] = sel.xpath('.//div[@itemprop="itemListElement"]/text()').extract_first()

        # get more club information
        club_href = sel.xpath('.//a/@href').extract_first()
        club_url = response.urljoin(club_href)
        request = scrapy.Request(club_url, callback=self.parse_club_page_2)

        request.meta['item'] = item
        yield request