scrapy 抓取多个页面 [3 级] 但抓取的数据未正确链接

Question

我正在尝试抓取 3 个级别的数据：电视名称 -> 季 -> 剧集。我遇到的问题是我得到了所有剧集，但前两个级别没有链接。例如第 1 季有 5 集，第 2 季有 10 集，我得到的输出是第 2 季有 15 集，第 1 季无处可寻。

def parse(self, response):
        jsonresponse = json.loads(response.body_as_unicode())

#for each television show extract the url to get the seasons
        for sel in jsonresponse["tv"]:
            item = TV()
            item['tv_name'] = sel['title']
            item['tv_url'] = sel['url']

            request = Request(item['tv_url'], callback = self.parse_season_details)
            request.meta['item'] = item
            yield request


    def parse_season_details(self, response):
        jsonresponse = json.loads(response.body_as_unicode())
        item = response.meta['item']

        for sel in jsonresponse["seasons"]:
            item['tv_season'] = sel['season_no']
            item['tv_season_url'] = sel['season_url']

            request = Request(item['tv_season_url'], callback = self.parse_episode_details)
            request.meta['item'] = item
            yield request

#okay I found my tv show, extracted number of seasons, now I'm going into each season to get the episode details

    def parse_episode_details(self, response):
        jsonresponse = json.loads(response.body_as_unicode())
        item = response.meta['item']

        for sel in jsonresponse["episodes"]:
            item['tv_episode_number'] = sel["ep"]
            item['tv_episode_name'] = sel['name']
            item['tv_episode_description'] = sel['description']
            yield item

但是 - 我得到的输出是这样的（假设第 1 季有 2 个 eps，第 2 季有 3 个 eps）

tvshow season2 ep1
tvshow season2 ep2
tvshow season2 ep1
tvshow season2 ep2
电视剧第 2 季第 3 集

调试了一下，好像是按照顺序执行的。对于第 2 级，season2 是最新被抓取并覆盖 season1 的，它被传递到第三级，继续提取剧集详细信息。

如果有人对如何解决这个问题有任何想法，我们将不胜感激！！！

谢谢

Answer 1

您应该每次都实例化一个新项目以产生。假设 TV() 是一项 class

class TV(Item):
    ....

你应该为每一集单独 item = TV()

如果你想从顶层传递数据 - 传递数据本身并仅在你实际产生它时创建一个项目：

def parse(self, response):
        jsonresponse = json.loads(response.body_as_unicode())

#for each television show extract the url to get the seasons
        for sel in jsonresponse["tv"]:

            request = Request(sel['url'], callback = self.parse_season_details)
            request.meta['tv_name'] = sel['title']
            request.meta['tv_url'] = sel['url']
            yield request

scrapy 抓取多个页面 [3 级] 但抓取的数据未正确链接

scrapy crawling multiple pages [3 levels] but scraped data not linking properly

python

arrays

web-crawler

scrapy