scrapy 抓取多个页面 [3 级] 但抓取的数据未正确链接

scrapy crawling multiple pages [3 levels] but scraped data not linking properly

我正在尝试抓取 3 个级别的数据:电视名称 -> 季 -> 剧集。我遇到的问题是我得到了所有剧集,但前两个级别没有链接。例如第 1 季有 5 集,第 2 季有 10 集,我得到的输出是第 2 季有 15 集,第 1 季无处可寻。

def parse(self, response):
        jsonresponse = json.loads(response.body_as_unicode())

#for each television show extract the url to get the seasons
        for sel in jsonresponse["tv"]:
            item = TV()
            item['tv_name'] = sel['title']
            item['tv_url'] = sel['url']

            request = Request(item['tv_url'], callback = self.parse_season_details)
            request.meta['item'] = item
            yield request


    def parse_season_details(self, response):
        jsonresponse = json.loads(response.body_as_unicode())
        item = response.meta['item']

        for sel in jsonresponse["seasons"]:
            item['tv_season'] = sel['season_no']
            item['tv_season_url'] = sel['season_url']

            request = Request(item['tv_season_url'], callback = self.parse_episode_details)
            request.meta['item'] = item
            yield request

#okay I found my tv show, extracted number of seasons, now I'm going into each season to get the episode details

    def parse_episode_details(self, response):
        jsonresponse = json.loads(response.body_as_unicode())
        item = response.meta['item']

        for sel in jsonresponse["episodes"]:
            item['tv_episode_number'] = sel["ep"]
            item['tv_episode_name'] = sel['name']
            item['tv_episode_description'] = sel['description']
            yield item

但是 - 我得到的输出是这样的(假设第 1 季有 2 个 eps,第 2 季有 3 个 eps)

  1. tvshow season2 ep1
  2. tvshow season2 ep2
  3. tvshow season2 ep1
  4. tvshow season2 ep2
  5. 电视剧第 2 季第 3 集

调试了一下,好像是按照顺序执行的。对于第 2 级,season2 是最新被抓取并覆盖 season1 的,它被传递到第三级,继续提取剧集详细信息。

如果有人对如何解决这个问题有任何想法,我们将不胜感激!!!

谢谢

您应该每次都实例化一个新项目以产生。 假设 TV() 是一项 class

class TV(Item):
    ....

你应该为每一集单独 item = TV()

如果你想从顶层传递数据 - 传递数据本身并仅在你实际产生它时创建一个项目:

def parse(self, response):
        jsonresponse = json.loads(response.body_as_unicode())

#for each television show extract the url to get the seasons
        for sel in jsonresponse["tv"]:

            request = Request(sel['url'], callback = self.parse_season_details)
            request.meta['tv_name'] = sel['title']
            request.meta['tv_url'] = sel['url']
            yield request