scrapy 抓取多个页面 [3 级] 但抓取的数据未正确链接
scrapy crawling multiple pages [3 levels] but scraped data not linking properly
我正在尝试抓取 3 个级别的数据:电视名称 -> 季 -> 剧集。我遇到的问题是我得到了所有剧集,但前两个级别没有链接。例如第 1 季有 5 集,第 2 季有 10 集,我得到的输出是第 2 季有 15 集,第 1 季无处可寻。
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
#for each television show extract the url to get the seasons
for sel in jsonresponse["tv"]:
item = TV()
item['tv_name'] = sel['title']
item['tv_url'] = sel['url']
request = Request(item['tv_url'], callback = self.parse_season_details)
request.meta['item'] = item
yield request
def parse_season_details(self, response):
jsonresponse = json.loads(response.body_as_unicode())
item = response.meta['item']
for sel in jsonresponse["seasons"]:
item['tv_season'] = sel['season_no']
item['tv_season_url'] = sel['season_url']
request = Request(item['tv_season_url'], callback = self.parse_episode_details)
request.meta['item'] = item
yield request
#okay I found my tv show, extracted number of seasons, now I'm going into each season to get the episode details
def parse_episode_details(self, response):
jsonresponse = json.loads(response.body_as_unicode())
item = response.meta['item']
for sel in jsonresponse["episodes"]:
item['tv_episode_number'] = sel["ep"]
item['tv_episode_name'] = sel['name']
item['tv_episode_description'] = sel['description']
yield item
但是 - 我得到的输出是这样的(假设第 1 季有 2 个 eps,第 2 季有 3 个 eps)
- tvshow season2 ep1
- tvshow season2 ep2
- tvshow season2 ep1
- tvshow season2 ep2
- 电视剧第 2 季第 3 集
调试了一下,好像是按照顺序执行的。对于第 2 级,season2 是最新被抓取并覆盖 season1 的,它被传递到第三级,继续提取剧集详细信息。
如果有人对如何解决这个问题有任何想法,我们将不胜感激!!!
谢谢
您应该每次都实例化一个新项目以产生。
假设 TV()
是一项 class
class TV(Item):
....
你应该为每一集单独 item = TV()
如果你想从顶层传递数据 - 传递数据本身并仅在你实际产生它时创建一个项目:
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
#for each television show extract the url to get the seasons
for sel in jsonresponse["tv"]:
request = Request(sel['url'], callback = self.parse_season_details)
request.meta['tv_name'] = sel['title']
request.meta['tv_url'] = sel['url']
yield request
我正在尝试抓取 3 个级别的数据:电视名称 -> 季 -> 剧集。我遇到的问题是我得到了所有剧集,但前两个级别没有链接。例如第 1 季有 5 集,第 2 季有 10 集,我得到的输出是第 2 季有 15 集,第 1 季无处可寻。
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
#for each television show extract the url to get the seasons
for sel in jsonresponse["tv"]:
item = TV()
item['tv_name'] = sel['title']
item['tv_url'] = sel['url']
request = Request(item['tv_url'], callback = self.parse_season_details)
request.meta['item'] = item
yield request
def parse_season_details(self, response):
jsonresponse = json.loads(response.body_as_unicode())
item = response.meta['item']
for sel in jsonresponse["seasons"]:
item['tv_season'] = sel['season_no']
item['tv_season_url'] = sel['season_url']
request = Request(item['tv_season_url'], callback = self.parse_episode_details)
request.meta['item'] = item
yield request
#okay I found my tv show, extracted number of seasons, now I'm going into each season to get the episode details
def parse_episode_details(self, response):
jsonresponse = json.loads(response.body_as_unicode())
item = response.meta['item']
for sel in jsonresponse["episodes"]:
item['tv_episode_number'] = sel["ep"]
item['tv_episode_name'] = sel['name']
item['tv_episode_description'] = sel['description']
yield item
但是 - 我得到的输出是这样的(假设第 1 季有 2 个 eps,第 2 季有 3 个 eps)
- tvshow season2 ep1
- tvshow season2 ep2
- tvshow season2 ep1
- tvshow season2 ep2
- 电视剧第 2 季第 3 集
调试了一下,好像是按照顺序执行的。对于第 2 级,season2 是最新被抓取并覆盖 season1 的,它被传递到第三级,继续提取剧集详细信息。
如果有人对如何解决这个问题有任何想法,我们将不胜感激!!!
谢谢
您应该每次都实例化一个新项目以产生。
假设 TV()
是一项 class
class TV(Item):
....
你应该为每一集单独 item = TV()
如果你想从顶层传递数据 - 传递数据本身并仅在你实际产生它时创建一个项目:
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
#for each television show extract the url to get the seasons
for sel in jsonresponse["tv"]:
request = Request(sel['url'], callback = self.parse_season_details)
request.meta['tv_name'] = sel['title']
request.meta['tv_url'] = sel['url']
yield request