Scrapy 项目提取范围问题
Scrapy item extraction scope issue
我在管道中 returning 一个 Scrapy 项目(播放器)时遇到了范围问题。我相当确定我知道问题出在哪里,但我不确定如何将解决方案集成到我的代码中。我也确信我现在已经为要处理的管道正确编写了代码。只是我在 parseRoster()
函数中声明了玩家项目,所以我知道它的范围仅限于该函数。
现在我的问题是,我需要在我的代码中的什么地方声明播放器项目才能使其对我的管道可见?我的目标是将这些数据获取到我的数据库中。我假设它将在我的代码的主循环中,如果是这种情况,我如何 return 项目和我新声明的播放器项目?
我的代码如下:
class NbastatsSpider(scrapy.Spider):
name = "nbaStats"
start_urls = [
"http://espn.go.com/nba/teams" ##only start not allowed because had some issues when navigated to team roster pages
]
def parse(self,response):
items = [] ##array or list that stores TeamStats item
i=0 ##counter needed for older code
for division in response.xpath('//div[@id="content"]//div[contains(@class, "mod-teams-list-medium")]'):
for team in division.xpath('.//div[contains(@class, "mod-content")]//li'):
item = TeamStats()
item['division'] = division.xpath('.//div[contains(@class, "mod-header")]/h4/text()').extract()[0]
item['team'] = team.xpath('.//h5/a/text()').extract()[0]
item['rosterurl'] = "http://espn.go.com" + team.xpath('.//div/span[2]/a[3]/@href').extract()[0]
items.append(item)
request = scrapy.Request(item['rosterurl'], callback = self.parseWPNow)
request.meta['play'] = item
yield request
print(item)
def parseWPNow(self, response):
item = response.meta['play']
item = self.parseRoster(item, response)
return item
def parseRoster(self, item, response):
players = Player()
int = 0
for player in response.xpath("//td[@class='sortcell']"):
players['name'] = player.xpath("a/text()").extract()[0]
players['position'] = player.xpath("following-sibling::td[1]").extract()[0]
players['age'] = player.xpath("following-sibling::td[2]").extract()[0]
players['height'] = player.xpath("following-sibling::td[3]").extract()[0]
players['weight'] = player.xpath("following-sibling::td[4]").extract()[0]
players['college'] = player.xpath("following-sibling::td[5]").extract()[0]
players['salary'] = player.xpath("following-sibling::td[6]").extract()[0]
yield players
item['playerurl'] = response.xpath("//td[@class='sortcell']/a").extract()
yield item
根据Scrapy's data flow的相关部分:
The Engine sends scraped Items (returned by the Spider) to the Item
Pipeline and Requests (returned by spider) to the Scheduler
换句话说,return/yield 你的项目实例来自蜘蛛,然后在 process_item()
method of your pipeline. Since you have multiple item classes, distinguish them by using isinstance()
built-in function:
中使用它们
def process_item(self, item, spider):
if isinstance(item, TeamStats):
# process team stats
if isinstance(item, Player):
# process player
我在管道中 returning 一个 Scrapy 项目(播放器)时遇到了范围问题。我相当确定我知道问题出在哪里,但我不确定如何将解决方案集成到我的代码中。我也确信我现在已经为要处理的管道正确编写了代码。只是我在 parseRoster()
函数中声明了玩家项目,所以我知道它的范围仅限于该函数。
现在我的问题是,我需要在我的代码中的什么地方声明播放器项目才能使其对我的管道可见?我的目标是将这些数据获取到我的数据库中。我假设它将在我的代码的主循环中,如果是这种情况,我如何 return 项目和我新声明的播放器项目?
我的代码如下:
class NbastatsSpider(scrapy.Spider):
name = "nbaStats"
start_urls = [
"http://espn.go.com/nba/teams" ##only start not allowed because had some issues when navigated to team roster pages
]
def parse(self,response):
items = [] ##array or list that stores TeamStats item
i=0 ##counter needed for older code
for division in response.xpath('//div[@id="content"]//div[contains(@class, "mod-teams-list-medium")]'):
for team in division.xpath('.//div[contains(@class, "mod-content")]//li'):
item = TeamStats()
item['division'] = division.xpath('.//div[contains(@class, "mod-header")]/h4/text()').extract()[0]
item['team'] = team.xpath('.//h5/a/text()').extract()[0]
item['rosterurl'] = "http://espn.go.com" + team.xpath('.//div/span[2]/a[3]/@href').extract()[0]
items.append(item)
request = scrapy.Request(item['rosterurl'], callback = self.parseWPNow)
request.meta['play'] = item
yield request
print(item)
def parseWPNow(self, response):
item = response.meta['play']
item = self.parseRoster(item, response)
return item
def parseRoster(self, item, response):
players = Player()
int = 0
for player in response.xpath("//td[@class='sortcell']"):
players['name'] = player.xpath("a/text()").extract()[0]
players['position'] = player.xpath("following-sibling::td[1]").extract()[0]
players['age'] = player.xpath("following-sibling::td[2]").extract()[0]
players['height'] = player.xpath("following-sibling::td[3]").extract()[0]
players['weight'] = player.xpath("following-sibling::td[4]").extract()[0]
players['college'] = player.xpath("following-sibling::td[5]").extract()[0]
players['salary'] = player.xpath("following-sibling::td[6]").extract()[0]
yield players
item['playerurl'] = response.xpath("//td[@class='sortcell']/a").extract()
yield item
根据Scrapy's data flow的相关部分:
The Engine sends scraped Items (returned by the Spider) to the Item Pipeline and Requests (returned by spider) to the Scheduler
换句话说,return/yield 你的项目实例来自蜘蛛,然后在 process_item()
method of your pipeline. Since you have multiple item classes, distinguish them by using isinstance()
built-in function:
def process_item(self, item, spider):
if isinstance(item, TeamStats):
# process team stats
if isinstance(item, Player):
# process player